Home
Downloading - All IT eBooks
Contents
1. Figure 9 25 Filtering csv files Click on OK and then click the Binary link under the Content column in Figure 9 25 Doing so will load the data from the csv file into the Query Editor You can rename the columns to more meaningful names before importing them into the Excel worksheet as illustrated in Figure 9 26 Query1 Preview downloaded at 7 24 AM lt m Symbol Date v OpenPrice v High Price v LowPrice Column6 Column Column8 Navigator 1 AAPL 5 8 2013 464 69 470 67 462 15 469 45 11369300 469 45 D 2 AAPL 2 8 2013 458 01 462 85 456 66 462 54 9780900 462 54 npr 3 AAPL 1 8 2013 455 75 456 8 453 26 455 68 7277400 456 68 A AAPL 31 07 2013 454 99 457 34 449 43 452 53 11518700 452 53 5 AAPL 30 07 2013 449 96 457 15 449 23 453 32 11050800 453 32 6 AAPL 29 07 2013 440 8 449 99 440 2 447 79 8859200 447 79 7 AAPL 26 07 2013 435 3 441 04 434 34 440 99 7148300 440 99 8 AAPL 25 07 2013 440 7 441 4 435 81 438 5 8196200 438 5 9 AAPL 24 07 2013 438 93 444 59 435 26 440 51 21140600 440 51 10 AAPL 23 07 2013 426 426 96 418 71 418 99 13192700 418 99 11 AAPL 22 07 2013 429 46 429 75 425 47 426 31 7421300 426 31 12 AAPL 19 07 2013 433 1 433 98 424 35 424 95 9597200 424 95 13 AAPL 18 07 2013 433 38 434 87 430 61 431 76 7817100 431 76 Figure 9 26 Formatting the data 165 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS Click on Apply and Close and the data will be imported to Excel You can see the total nu
2. DoMapReduce DoHiveOperations MonitorCluster Console Write Press any key to exit Console ReadKey List existing HDI clusters public static void ListClusters var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds var clusters client ListClusters Console WriteLine The list of clusters and their details are foreach var item in clusters Console WriteLine Cluster 0 Nodes 1 State 2 Version 3 item Name item ClusterSizeInNodes item State item Version Create a new HDI cluster public static void CreateCluster var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Cluster information var clusterInfo new ClusterCreateParameters Name AutomatedHDICluster Location North Europe DefaultStorageAccountName Constants storageAccount DefaultStorageAccountKey Constants storageAccountKey De
3. To create a Power View report based on the PowerPivot data model created earlier open the workbook with the PowerPivot model click on the Insert ribbon in Excel and select Power View as shown in Figure 9 21 161 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS HOME INSERT PAGE LAYOUT FORMULAS DATA REVIEW VIEW POWER QUERY POWERPIVOT Tel CH we a VS ZE vn siet e Ge gt B L Lk 4 ea 9 wh OTI S i bo MM Rd L VE ng Me date ia Recommended Table Pictures Online Shapes SmartArt Screenshot Apps for Recommended j PivotChart Power Line Column Win PivotTables Pictures Office Charts D l s View Loss Tables Illustrations Apps Charts f amp Reports Sparklines e e Insert a Power View Report Figure 9 21 Launching Power View for Excel This launches a new Power View window with the PowerPivot model already available to it With the Power View window open you are now ready to create a report chart or other visualization Here are the steps to follow 1 Once Power View opens click on the chart to select and highlight it Drag and drop Average of stock_price_close into the fields section 2 Click the Line Chart graph in Design ribbon to switch to the chart and expand the graph to fit it to the canvas Change the title to Stock Comparison Drag Hdate to the Filters field in the report Drag exchange to the Tile By column Drag FullDateAlternateKey to Axis no m FP Sa Drag stock_symbol to Legend
4. LA Services Local Select an item to view its description Name lt Description Status Startup Type Log On As iS Apache Hadoop Derbyserver Started Manual Ahdp bo Apache Hadoop hiveserver Started Manual Ahdp Apache Hadoop hiveserver2 Started Manual Ahdp KOA Apache Hadoop isotopejs Started Manual admin Ob Apache Hadoop jobtracker Manual Ahdp Apache Hadoop metastore Started Manual Ahdp 2 Apache Hadoop namenode Manual Ahdp OA Apache Hadoop oozieservice Started Manual Ahdp Oh Apache Hadoop templeton Started Manual hdp Figure 6 17 Apache Hadoop Windows services These services are unique to Hadoop on Windows and Table 6 2 summarizes the function of each service Table 6 2 The Hadoop Windows services Service Function Apache Hadoop Derbyserver Runs the service for Hive s native embedded database technology called Derby Apache Hadoop hiveserver Simulates Hive s native thrift service for remote client connectivity Apache Hadoop hiveserver2 Same as hiveserver with support for concurrency for ODBC and JDBC Apache Hadoop isotopejs Runs the required handlers for the interactive consoles that are available on the HDInsight management portal Apache Hadoop jobtracker Runs the Hadoop job tracker service Apache Hadoop metastore Runs the Hive Oozie metastore services Apache Hadoop namenode Runs the Hadoop NameNode service Apache Hadoop oozieservice Runs the Oozie service Apache Hadoop templeton Runs th
5. Troubleshooting Visual Studio Deployments As described in Chapter 4 you can use the Hadoop NET SDK classes to programmatically deploy your HDInsight clusters through Microsoft Visual Studio projects The Visual Studio IDE gives you a couple of great ways to debug your application when some operation throws errors or does not produce the desired output Using Breakpoints A breakpoint is a special marker in your code that is active when executing the program while using the Visual Studio debugger When the marker is reached it causes the program to pause changing the execution mode to break mode You can then step through the code line by line using the Visual Studio debugging tools while monitoring the contents of local and watched variables You can set a breakpoint on a particular line from the Debug menu of Visual Studio or by simply pressing the function key F9 Figure 12 3 shows a sample scenario in your HadoopClient solution where a breakpoint is hit and you can examine your variable values 211 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS public static void CreateCluster var store new XS 9Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt xs509Certificate2 gt First item gt item Thumbprint Constants thumbprint var client new ClusterProvisioningClient Constants subscriptionId cert Cluster information var clusterDetails new HOInsightClusterCreationDetails
6. var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds client DeleteCluster AutomatedHDICluster ListClusters 48 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING After executing the DeleteCluster method you can go back to the Azure portal and confirm that the AutomatedHD ICluster which we just provisioned through code no longer exists You see only the two clusters that were previously created as shown in Figure 4 8 NAME STATUS SUBSCRIPTION NAME LOCATION datadork gt WH Running East US democluster VW Running as East US Figure 4 8 AutomatedHD ICluster is deleted Using the HDInsight management package you can easily list create and delete your HDInsight clusters on Azure Add a call to the functions we added earlier inside the Main method and call them sequentially to view the output in the console window The complete code listing for the Program cs file along with the Main method is provided in Listing 4 5 Listing 4 5 The Complete Code using System using System Collections Generic using System Ling using System Text using System Security Cryptography X509Certificates using Microsoft WindowsAzure Management HDInsight namespace HadoopClient class Program static void Mai
7. Loading Data You can feed data to your Hive tables by simply copying data files into the appropriate folders A table s definition is purely a metadata schema that is applied to the data files in the folders when they are queried This makes it easy to define tables in Hive for data that is generated by other processes and deposited in the appropriate folders when ready Additionally you can use the HiveQL LOAD statement to load data from an existing file into a Hive table This statement moves the file from its current location to the folder associated with the table LOAD does not do any transformation while loading data into tables LOAD operations are currently pure copy move operations that move data files into locations corresponding to Hive tables This is useful when you need to create a table from the results of a MapReduce job or Pig script that generates an output file alongside log and status files The technique enables you to easily add the output data to a table without having to deal with additional files you do not want to include in the table For example Listing 8 6 shows how to load data into the analysis stock table created earlier You can execute the following PowerShell script which will load data from TableMSFT csv Listing 8 6 Loading data to a Hive table subscriptionName YourSubscriptionName storageAccountName democluster containerName democlustercontainer clustername democluster querystr
8. Once you start up the Hadoop services using the start onebox cmd file you see output similar to Listing 7 2 in the console Listing 7 2 start onebox cmd c Hadoop gt start onebox cmd Starting Hadoop Core services Starting Hadoop services Starting namenode The Apache Hadoop namenode service is starting The Apache Hadoop namenode service was started successfully Starting datanode The Apache Hadoop datanode service is starting The Apache Hadoop datanode service was started successfully 122 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR Starting secondarynamenode The Apache Hadoop secondarynamenode service is starting The Apache Hadoop secondarynamenode service was started successfully Starting jobtracker The Apache Hadoop jobtracker service is starting The Apache Hadoop jobtracker service was started successfully Starting tasktracker The Apache Hadoop tasktracker service is starting The Apache Hadoop tasktracker service was started successfully Starting historyserver The Apache Hadoop historyserver service is starting The Apache Hadoop historyserver service was started successfully Starting Hive services Starting hwi The Apache Hadoop hwi service is starting The Apache Hadoop hwi service was started successfully Starting derbyserver The Apache Hadoop derbyserver service is starting The Apache Hadoop derbyserver service was started successfully Starting metastore The Apache Hadoop met
9. stock date stock_pric i 5 8 2013 464 69 2 8 2013 458 01 1 8 2013 455 75 31 07 2013 454 99 30 07 2013 449 96 29 07 2013 440 8 26 07 2013 435 3 25 07 2013 440 7 24 07 2013 438 93 23 07 2013 426 22 07 2013 429 46 19 07 2013 433 1 18 07 2013 433 38 17 07 2013 429 7 16 07 2013 426 52 Figure 10 13 Preview Hive query results Navigate to the Columns tab Confirm that all columns from the source Hive table are detected and fetched as shown in Figure 10 14 178 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES Connection Manager Available External Columns Error Output Nane stock_date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close exchange External Column Output Column i stock_symbol i stock_symbol stock_date stock_date stock_price_open stock_price_open stock_price_high stock_price_high stock_price_low stock_price_low stock_price_close stock_price_close stock_volume stock_volume stock_price_adj_close stock_price_adj_close exchange exchange Figure 10 14 Hive table columns Creating the SQL Destination Component After the source is configured you need to configure the destination where you want to import the Hive data In this example I use SQL Server as the destination To do this double click on OLE DB Destination component in the Toolbox and place an OLE DB Destination component on th
10. 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 98 Running job job_201311240635_0197 map 0 reduce 0 map 100 reduce 0 Job complete job_201311240635_0197 Counters 19 Job Counters SLOTS_MILLIS_MAPS 37452 Total time spent by all reduces waiting after reserving Total time spent by all maps waiting after Launched map tasks 1 SLOTS MILLIS REDUCES 0 File Output Format Counters Bytes Written 2148196 FileSystemCounters FILE_BYTES_READ 770 HDFS_ BYTES READ 87 FILE_BYTES_WRITTEN 76307 WASB_BYTES_WRITTEN 2148196 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE 13 12 10 01 05 45 INFO mapred JobClient File Input Format Counters 13 12 10 01 05 45 INFO mapred JobClient Bytes Read 0 13 12 10 01 05 45 INFO mapred JobClient Map Reduce Framework 13 12 10 01 05 45 INFO mapred JobClient Map input records 36153 13 12 10 01 05 45 INFO mapred JobClient Physical memory bytes snapshot 215248896 13 12 10 01 05 45 INFO mapred JobClient Spilled Records 0 13 12 10 01 05 45 INFO mapred JobClient CPU time spent ms 5452 13 12 10 01 05 45 INFO mapred JobClient Total committed heap usage bytes 514523136 13 12 10 01 05 45 INFO mapred JobClient Virtual memory bytes snapshot 653586432 13 12 10 01 05 45 INFO mapred JobClient Map output records 36153 13 12 10 01 05 45 INFO m
11. 4 get_table db default tbl HiveSampleTable Again the preceding log output is stripped for brevity but you can see how the log emits useful information such as several port numbers the query that it fires to load the default tables the number of worker threads and much more In the case of a Hive processing error this log is the best place to look for further insight into the problem Note A lot of documentation is available on Apache s site regarding the logging framework that Hadoop and its supporting projects implement That information is not covered in depth in this chapter which focuses on HDInsight specific features Log4j Framework There are a few key properties in the Log4j framework that will help you maintain your cluster storage more efficiently If all the services are left with logging every bit of detail in the log files a busy Hadoop cluster can easily run you out of storage space especially in scenarios where your name node runs most of the other services as well Such logging configurations can be controlled using the Log4j properties file present in the conf directory for the projects For example Figure 11 4 shows the configuration file for my Hadoop cluster 194 CHAPTER 11 LOGGING IN HDINSIGHT SE a 7 5 bh ye Computer Local Disk C apps dist hadoop 1 2 0 1 3 1 0 06 conf yanize y Open New folder ir Favorites Name Date modified Type Size BC Desktop 7 capacity sch
12. CurrentDirectory to C Windows SysWOW64 WINPKG Current Directory C Windows SysWOW64 WINPKG Package C HadoopInstallFiles HadoopSetupTools HadoopPackages hdp 1 0 1 winpkg zip WINPKG Action install WINPKG Action arguments WINPKG Run WinpkgAction C HadoopInstallFiles HadoopSetupTools HadoopPackages hdp 1 0 1 winpkg zip C HadoopInstallFiles HadoopPackages install WINPKG UNZIP source C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg zip WINPKG UNZIP destination C HadoopInstallFiles HadoopPackages WINPKG UNZIP unzipRoot C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg WINPKG Unzip of C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg zip to C HadoopInstallFiles HadoopPackages succeeded WINPKG UnzipRoot C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg WINPKG C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts install ps1 HDP Logging to existing log C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log HDP Logging to C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log HDP HDP_INSTALL_PATH C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts HDP HDP_RESOURCES DIR C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources HDP INSTALLATION STARTED HDP Installing HDP version to c hadoop 208 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS HDP Installing Java HDP Installing Java HDP Logging to existing log C Had
13. Figure 9 19 Designing the PivotChart You should be able to see a graphical summary of the closing price of the stocks of the companies over a period of time as shown in Figure 9 20 160 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS maart erg GOOG CU wuer mORCL rT D I 9 27 2006 AA Ze Figure 9 20 The stock summary chart In the next section you will see how you can use Power View to consume the PowerPivot data model and quickly create intelligent and interactive visualizations out of the stock market data Power View for Excel Power View is a feature of Microsoft Excel 2013 and it s also a feature of Microsoft SharePoint 2013 as part of the SQL Server 2012 Service Pack 1 Reporting Services Add in for Microsoft SharePoint Server 2013 Enterprise Edition Power View in Excel 2013 and Power View in SharePoint 2013 both provide an interactive data exploration visualization and presentation experience for all skill levels and they have similar features for designing Power View reports This chapter shows a sample Power View report based on the stock_analysis table s data in Hive to give you a quick look at the powerful visualization features from the surface level Details about how to design a Power View report as well as details about Power View integration with SharePoint are outside the scope of this book Neither topic is discussed in depth Note Power View is supported only in Excel 2013
14. In some cases depending on your operating system and account security policies you might need to unblock the downloaded cmdlets zip file to let it load into PowerShell You can do it from the properties of the zip file as shown in Figure 4 10 Attributes _ Read only _ Hidden Advanced Security This file came from another computer and might be blocked to Unblock help protect this computer P ok caos Figure 4 10 Unblock downloaded content Also depending on your system s security configuration you might need to set PowerShell execution policy so that it can execute remotely signed assemblies To do this launch Windows Azure PowerShell as an administrator and execute the following command Set ExecutionPolicy RemoteSigned If you do not do this and your security setting does not allow you to load a d11 file that is built and signed on a remote system you will see similar error messages in PowerShell while trying to import the Microsoft WindowsAzure Management HDInsight Cmdlet d11 Import Module The specified module D Microsoft WindowsAzure Management HDInsight Cmdlets Microsoft WindowsAzure Management HDInsight Cmdlet dl1 wasnot loaded because no valid module file was found in any module directory Once the cmdlet is successfully loaded the first thing you need to do is associate the subscription id and the management certificate for your Azure subscription with the cmdlet variables You can use the followin
15. Once the design is complete you should be able to see the Power View report comparing the different stock prices in a line chart It is categorized based on the NASDAQ and NYSE and it gives you a visualization of the stock prices with just a few clicks Your Power View report should now look like Figure 9 22 wer View Field au Stock Comparison VIEW D DimDate I stock _anabysis alundarQuarter x WR NASDAQ b Calendarven d Ail E EnglishMonths x 3 Showing representative sample Dy dF ateAlternateKey x umn 800 AAPL 700 G00G MSFT 600 Drag fields between areas belewe 500 meer exchange 400 E vatut D Average of stock_price_ctove 300 200 FutDateARematetey H 2005 2006 2007 2008 2009 2010 HORIZONTAL MULTIPLES Figure 9 22 The Stock Comparison report 162 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS Power BI The Future Power BI for Office provides users with powerful new ways to work with data in Excel from a variety of data sources It lets you easily search discover and access data within and outside your organization and with just a few clicks shape transform analyze and create stunning interactive visualizations out of the data These visualizations uncover hidden insights you can share and you can collaborate from anywhere on any device In this section you will look at two offerings in the Power BI suite e Power Query e Power Map Power Query is a mash up tool designed to
16. PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER S Windows Azure Subscriptions storage NAME STATUS LOCATION s portalvhdsgl8czSbdj6b9k Ke Southeast Asia debarchans VW Online Southeast Asia H hadooponcloud V Online East US H portalvhdsvifr7jgdSvdg1 WV Online Southeast Asia Hos atge P K S Figure 3 1 Windows Azure Management Portal Note You might need to provide your Azure subscription credentials the first time you try to access the Management Portal Click on the NEW button in the lower left corner to bring up the NEW gt DATA SERVICES gt STORAGE window as shown in Figure 3 2 24 www allitebooks com CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER SQL DATABASE g QUICK CREATE Create and manage storage accounts on Windows Azure for Blobs Tables STORAGE and Queues APP SERVICES Zap HDINSIGHT NETWORK SERVICES STORE RECOVERY SERVICES Figure 3 2 New storage account Click on QUICK CREATE Provide the storage account name and select the location of the data center region If you have multiple subscriptions you can also choose to select the one that gets billed according to your usage of the storage account After providing all these details your screen should look like Figure 3 3 URL COMPUTE SQL DATABASE DATA SERVICES STORAGE core windows net LOCATION AFFINITY GROUP APP SERVICES en HDINSIGHT East US M NETWORK SERVICES STORE RECOVERY SER
17. You need to configure a connection to point to the SQL Server instance and the database table where you will import data from Hive For this you need to create a connection manager to the destination SQL as you did for the source Hive Right click in the Connection Managers section of the project again and this time choose New OLE DB Connection as shown in Figure 10 9 173 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES FUE Wee New OLE DB Connection New Flat File Connection New ADO NET Connection New Analysis Services Connection New File Connection New Connection Cut Ctrl X 1 Copy Ctrl C ZA Paste Ctrl V X Delete Del Connection Managers Le Rename 3 Hive Connection 2 Properties Alt Enter Figure 10 9 Creating a new OLE DB connection to SQL Server From the list of providers select Native OLE DB gt SQL Server Native Client 11 0 Type the name of the target SQL Server instance and select the database where the target table resides The test connection should succeed thereby confirming the validity of the connection manager for the destination as shown in following Figure 10 10 174 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES I Connection Manager S j Server name ADENALIRTM Log on to the server Use Windows Authentication Use SQL Server Authentication Username Password _ Save my passwor
18. www allitebooks com CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING The CLI can be installed in one of two ways e From the Node js Package Manager NPM do the following a Navigate to www nodejs org b Click on Install and follow the instructions accepting the default settings c Open a command prompt and execute the following command npm install g azure cli e From the Windows Installer do the following a Navigate to http www windowsazure com en us downloads b Scroll down to the Command line tools section and then click Cross platform Command Line Interface and follow the Web Platform Installer wizard instructions Once the installation is complete you need to verify the installation To do that open Windows Azure Command Prompt and execute the following command azure hdinsight h If the installation is successful this command should display the help regarding all the HDInsight commands that are available in CLI Note If you get an error that the command is not found make sure you have the path C Program Files x86 Microsoft SDKs Windows Azure CLI wbin to the PATH environment variable in the case of Windows Installer For NPM make sure that the path C Program Files x86 nodejs C Users username AppData Roaming npm is appended to the PATH variable Once it is installed execute the following command to download and save the publishsettings file azure account download You should se
19. CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS CREATE USER Create user completed CREATE USER Adding user to the local group CREATE USER Group HadoopUsers successfully created CREATE USER User hadoop successfully added to HadoopUsers HDP Installing Hadoop Core HDP Setting HDFS DATA DIR to c hadoop HDFS at machine scope HDP Invoke Winpkg C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources winpkg ps1 C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg zip install credentialFilePath c hadoop singlenodecreds xml Verbose WINPKG Logging to existing log C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log WINPKG ENV WINPKG_BIN is C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources WINPKG Setting Environment CurrentDirectory to C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts WINPKG Current Directory C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts WINPKG Package C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg zip WINPKG Action install WINPKG Action arguments credentialFilePath c hadoop singlenodecreds xml WINPKG Run WinpkgAction C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg zip C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources install credentialFilePath c hadoop singlenodecreds xml WI
20. File DSN Drivers Tracing Connection Pooling About ODBC Drivers that are installed on your system Name Version Company File 1 00 00 00 Microsoft Comoration HIVEODI SQL Server 6 01 7601 17514 Microsoft Comoration SQLSRV SQL Server Native Client 11 0 2011 110 3000 00 Microsoft Comoretion SQLNCL An ODBC driver allows ODBC enabled programs to get information from ODBC data sources To install new drivers use the driver s setup program Se al S Figure 8 4 ODBC Data Source Administrator Note There are two versions of the ODBC Data Source Administrator Ul one for 32 bit windir SysWOW64 odbcad32 exe and one for 64 bit windir System32 odbcad32 exe You ll likely want to create both 32 bit and 64 bit DSNs just make sure that the same name is used for both versions At a minimum you ll need to register a 32 bit DSN to use when creating your SSIS package in the designer in Chapter 10 The presence of the Microsoft Hive ODBC driver under the list of available ODBC drivers ensures that it has been installed successfully Testing the Driver Once the driver is installed successfully the next step is to ensure that you can make a connection to Hive using the driver First create a System DSN In ODBC Data Sources Administrator go to the System DSN tab and click on the Add button as shown in Figure 8 5 138 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC OR BC DataSource Admi
21. Hive The latter is really interesting because it builds on the established technology for NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query 58 CHAPTER 5 Submitting Jobs to Your HDinsight Cluster Apart from the cluster management operations you saw in the previous chapter you can use the NET SDK and the Windows PowerShell cmdlets to control your job submission and execution in your HDInsight cluster The jobs are typically MapReduce jobs because that is the only thing that Hadoop understands You can write your MapReduce jobs in NET and also use supporting projects such as Hive Pig and so forth to avoid coding MapReduce programs which can often be tedious and time consuming In all the samples I have shown so far I used the command line consoles However this does not need to be the case you can also use PowerShell The Console application that is used to submit the MapReduce jobs calls a NET Submissions API As such one can call the NET API directly from within PowerShell similar to the cluster management operations You will use the same console application you created in the previous chapter and add the functions for job submissions In this chapter you will learn how to implement a custom MapReduce program in NET and execute it as a Hadoop job You will also take a look at how to execute the sample wordcount MapReduce job and a Hive query using NET a
22. Programs and Features list If there is a problem with the installation the first thing you should do is to go to the Programs and Features page in the Control Panel and check for these two items e Microsoft HDInsight Emulator for Windows Azure e Hortonworks Data Platform 1 1 Developer Uninstall these items and repeat the installation procedure The order of uninstall is important You should uninstall the Windows Azure HDInsight Emulator first and then the Hortonworks Data Platform 1 1 Developer The best approach to troubleshoot such installation uninstallation issues is to enable MSI logging You can follow the instructions in the following Knowledge Base article to set up MSI logging http support microsoft com 2233000 After enabling logging repeat the action that failed and the log that s generated should point you in the right direction If it turns out that uninstallation is failing due to missing setup files then you can probably try to get the missing files in place from another installation of the emulator Just to reiterate the HDInsight emulator is a single node deployment So don t be surprised when you see that the number of live nodes in your cluster is 1 after you launch the Hadoop Name Node Status portal as shown in Figure 7 5 117 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR NameNode 127 0 0 1 8020 Started Tue Oct 29 07 27 23 PDT 2013 Version 1 1 0 SNAPSHOT r56179ddb38bfec1016c1ae0ae13a9f9c1
23. The Windows Azure HDInsight service provides everything you need to quickly deploy manage and use Hadoop clusters running on Windows Azure If you have a Windows Azure subscription you can deploy your HDInsight clusters using the Azure management portal Creating your cluster is nothing but provisioning a set of virtual machines in Microsoft Cloud with the Apache Hadoop and its supporting projects bundled in it The HDInsight service gives you the ability to gain the full value of Big Data with a modern cloud based data platform that manages data of any type whether structured or unstructured and of any size With the HDInsight service you can seamlessly store and process data of all types through Microsoft s modern data platform that provides simplicity ease of management and an open enterprise ready Hadoop service all running in the cloud You can analyze your Hadoop data directly in Excel using new self service business intelligence BI capabilities like Data Explorer and Power View 14 www allitebooks com CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE HDInsight Versions You can choose your HDInsight cluster version while provisioning it using the Azure management dashboard Currently there are two versions that are available but there will be more as updated versions of Hadoop projects are released and Hortonworks ports them to Windows through the Hortonworks Data Platform HDP Cluster Version 2 1 The defau
24. Your Password AsPlainText Force myCreds New Object System Management Automation PSCredential admin secpasswd The sequence of operations needed to move you toward a job submission through PowerShell is pretty much the same as in the NET client e Creating the job definition e Submitting the job e Waiting for the job to complete e Reading and displaying the output The following piece of PowerShell script does that in sequence Define the word count MapReduce job mapReduceJobDefinition New AzureHDInsightMapReduceJobDefinition JarFile jarFile ClassName class Arguments inputPath outputPath Submit the MapReduce job Select AzureSubscription subscription 80 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER wordCountJob Start AzureHDInsightJob Cluster cluster JobDefinition mapReduceJobDefinition Credential myCreds Wait for the job to complete Wait AzureHDInsightJob Job wordCountJob WaitTimeoutInSeconds 3600 Credential myCreds Get the job standard error output Get AzureHDInsightJobOutput Cluster cluster JobId wordCountJob JobId StandardError Subscription subscription Get the blob content Get AzureStorageBlobContent Container Container Blob example data WordCountOutputPS part r 00000 Context storageContext Force List the content of the output file cat example data WordCountOutputPS part r 00000 findstr human Note Because the output would be a huge number o
25. storageAccountName StorageAccountKey storageAccountKey 82 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER inputPath wasb example data gutenberg davinci txt g outputPath wasb example data WordCountOutput jarFile wasb example jars hadoop examples jar class wordcount passwd ConvertTo SecureString Your Password AsPlainText Force myCreds New Object System Management Automation PSCredential admin secpasswd Define the word count MapReduce job mapReduceJobDefinition New AzureHDInsightMapReduceJobDefinition JarFile jarFile ClassName class Arguments inputPath outputPath Submit the MapReduce job Select AzureSubscription subscription wordCountJob Start AzureHDInsightJob Cluster cluster JobDefinition mapReduceJobDefinition Credential myCreds Wait for the job to complete Wait AzureHDInsightJob Job wordCountJob WaitTimeoutInSeconds 3600 Credential myCreds Get the job standard error output Get AzureHDInsightJobOutput Cluster cluster JobId wordCountJob JobId StandardError Subscription subscription Get the blob content Get AzureStorageBlobContent Container Container Blob example data WordCountOutputPS part r 00000 Context storageContext Force List the content of the output file cat example data WordCountOutputPS part r 00000 findstr human Executing The Job You can execute the script directly from PowerShell ISE or use the Wi
26. store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new JobSubmissionCertificateCredential Constants subscriptionId cert Constants clusterName Then create a job submission client object and submit the Hive job based on the definition var jobClient JobSubmissionClientFactory Connect creds JobCreationResults jobResults jobClient CreateHiveJob hiveJobDefinition Console Write Executing Hive Job Wait for the job to complete WaitForJobCompletion jobResults jobClient Finally you are ready to read the blob storage and display the output Print the Hive job output System 10 Stream stream jobClient GetJobOutput jobResults JobId System 10 StreamReader reader new System 10 StreamReader stream Console Write Done List of Tables are n Console WriteLine reader ReadToEnd Listing 5 10 shows the complete DoHiveOperations method Note that it uses the same WaitForJobCompletion method to wait and display progress while the job execution is in progress 72 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Listing 5 10 DoHiveOperations method public static void DoHiveOperations HiveJobCreateParameters hiveJobDefinition new HiveJobCreateParameters JobName Show tables job StatusFolder TableListFolder Query show tables D I var s
27. the C apps dist oozie 3 3 2 1 3 1 0 06 oozie win distro logs directory ooziejpa log Reports oozie database persistence level log messages It is present in the C apps dist oozie 3 3 2 1 3 1 0 06 oozie win distro logs directory 197 CHAPTER 11 LOGGING IN HDINSIGHT oozieops log This file records all administrative tasks and operations messages for Oozie It is present in the C apps dist oozie 3 3 2 1 3 1 0 06 00zie win distro logs directory ooizeinstrumentation log This file records Oozie instrumentation data and is refreshed every 60 seconds It is present in the C apps dist oo0zie 3 3 2 1 3 1 0 06 00zie win distro logs directory pig _ lt Random_Number gt log This file logs the results of Pig job executions from the Grunt shell It is found in the C apps dist hadoop 1 2 0 1 3 1 0 06 logs folder Collectively all these different types of log files will help you figure out issues in the event of a failure during service startup job submission or job execution Windows ODBC Tracing One of the most common ways to consume HDiInsight data is through Hive and the ODBC layer it exposes The Windows operating system has built in capabilities to trace all the ODBC driver API calls and their return values Often when client applications like Excel Integration Services and others fail to connect to HDInsight using the ODBC driver the driver logging mechanism comes in handy Third party ODBC drivers might not have bu
28. tools This book specifically covers HDInsight which is Microsoft s implementation of Hadoop on Windows The book covers HDInsight and its tight integration with the ecosystem of other Microsoft products like SQL Server Excel and various BI tools Readers should have some understanding of those tools in order to get the most from this book Versions Used It is important to understand that HDInsight is offered as an Azure service The upgrades are pretty frequent and come in the form of Azure Service Updates Additionally HDInsight as a product has core dependencies on Apache Hadoop Every change in the Apache project needs to be ported as well Thus you should expect that version numbers of several components will be updated and changed going forward However the crux of Hadoop and HDInsight is not going to change much In other words the core of this book s content and methodologies are going to hold up well xix www allitebooks com INTRODUCTION Structure of the Book This book is best read sequentially from the beginning to the end I have made an effort to provide the background of Microsoft s Big Data story HDInsight as a technology and the Windows Azure Storage infrastructure This book gradually takes you through a tour of HDInsight cluster creation job submission and monitoring and finally ends with some troubleshooting steps XX Chapter 1 Introducing HDInsight starts off the book by giving you some back
29. 00 615 63 645 7 00 usage quick glance E CLUSTER CORES MMMM ALL CLUSTERS CORES REMAINING CORES STATUS EE ee Wm LOCATION CORES 47 of 170 HDINSIGHT CORES North Europe CREATION DATE z 11 15 2013 12 34 31 AM linked resources VERSION NAME TYPE M STATUS p 2 1 0 0 275681 democlusterD8 SQL Database J Online SUBSCRIPTION NAME lt democlusterstorage Storage Account V Online SUBSCRIPTION ID Sees Za eg teg aT EEE Figure 3 15 The HDInsight cluster dashboard 33 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER You can also click the MONITOR option to have a closer look at the currently active mappers and reducers as shown in Figure 3 16 Again we will come back to this screen later while running a few map reduce jobs on the cluster democluster AA DASHBOARD MONITOR CONFIGURATION SDPHDIL ACTIVE MAP TASKS CFDS8ES ACTIVE REDUCE TASKS CFDS RELATIVE v ONEHOUR v U demoduster datadork tutorial 600 615 630 645 7 00 NAME SOURCE MIN MAX AVG TOTAL D iv Active Map Tasks 1058e84 8555 497b b13 O Total 0 Total 0 Total 0 Total Active Reduce Tasks cfd58e84 8555 497b b13 O Total 0 Total 0 Total 0 Total Figure 3 16 Monitoring your cluster You can also choose to alter the filters and customize the refresh rate for the dashboard as shown in Figure 3 17 ABSOLUTE FOURHOURSY C Y axis x 1 Figure 3 17 Setting the dashboard refresh rate Configuring the Cluster If you want to control the
30. 01 37 50 INFO mapred JobClient Map Reduce Framework 13 12 10 01 37 50 INFO mapred JobClient Map input records 36153 13 12 10 01 37 50 INFO mapred JobClient Physical memory bytes snapshot 779915264 13 12 10 01 37 50 INFO mapred JobClient Spilled Records 0 13 12 10 01 37 50 INFO mapred JobClient CPU time spent ms 17259 13 12 10 01 37 50 INFO mapred JobClient Total committed heap usage bytes 2058092544 13 12 10 01 37 50 INFO mapred JobClient Virtual memory bytes snapshot 2608484352 13 12 10 01 37 50 INFO mapred JobClient Map output records 36153 13 12 10 01 37 50 INFO mapred JobClient SPLIT RAN BYTES 792 13 12 10 01 37 50 INFO mapreduce ExportJobBase Transferred 792 bytes in 53 6492 seconds 14 7626 bytes sec 13 12 10 01 37 50 INFO mapreduce ExportJobBase Exported 36153 records As you can see Sqoop is a pretty handy import export tool for your cluster s data allowing you to go easily to and from a SQL Azure database Sqoop allows you to merge structured and unstructured data and to provide powerful analytics on the data overall For a complete reference of all the available Sqoop commands visit the Apache documentation site at https cwiki apache org confluence display SQOOP Home The Pig Console Pig is a set based data transformation tool that works on top of the Hadoop stack to manipulate data sets to add and remove aggregates and to transform data Pig is most analogous to the Dataflow task in SQL Server Integra
31. 06 12 10 2013 2 45 AM File Folder Ak hive 0 11 0 1 3 1 0 06 12 10 2013 2 46 AM File folder i A isotopejs 12 10 2013 2 47 AM File Folder J java 12 10 2013 2 45AM File Folder Ak log jetwappender 12 10 2013 2 45 4m File Folder A oozie 3 3 2 1 3 1 0 06 12 10 2013 2 45 ANM File Folder A pig 0 11 0 1 3 1 0 06 12 10 2013 2 45 4M File Folder D sqlidbe_3 0 12 10 2013 2 46 AM File folder A sqoop 1 4 3 1 3 1 0 06 12 10 2013 2 45 ANM File Folder Figure 6 19 Hadoop on the Windows installation directory 110 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Note The Java runtime is also deployed in the same directory Summary In this chapter you read about enabling Remote Desktop and logging on to the HDInsight cluster s name node with proper cluster credentials The name node is the heart of the cluster and you can do all the operations from the name node that you can from the management portal or the NET SDK and PowerShell scripts The name node gives you access to the Hadoop command line and the web interfaces that are available with the distribution HDInsight simulates WASB as HDFS behind the scenes for the end users You saw how actually all the input and output files are saved back to your Azure storage account dedicated for the cluster through the Azure Management portal The WASB mechanism is an abstraction to the user who sees a simulation of HDFS when dealing with file system operations You learned to execute basic HDFS MapReduce comm
32. 1 0 1 winpkg resources java zip HDP Setting JAVA_HOME to c hadoop java at machine scope HDP Done Installing Java HDP C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts create_hadoop_user ps1 credentialFilePath c hadoop singlenodecreds xml CREATE USER Logging to existing log C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log CREATE USER Logging to C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log CREATE USER HDP_INSTALL_PATH C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts CREATE USER HDP_RESOURCES DIR C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources CREATE USER Username not provided Using default username hadoop CREATE USER UserGroup not provided Using default UserGroup HadoopUsers CREATE USER Password not provided Generating a password CREATE USER Saving credentials to c hadoop singlenodecreds xml while running as FAREAST desarkar CREATE USER Creating user hadoop CREATE USER User hadoop created CREATE USER Granting SeCreateSymbolicLinkPrivilege CREATE USER C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources installHelper2 exe u PUMBAA hadoop r SeCreateSymbolicLinkPrivilege CREATE USER SeCreateSymbolicLinkPrivilege granted CREATE USER Granting SeServiceLogonRight CREATE USER C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources installHelper2 exe u PUMBAA hadoop r SeServiceLogonRight 209
33. 1 3 0 1 0302 conf directory Listing 13 7 is a sample snippet that shows how you can specify the Hive log file path 226 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Listing 13 7 hive site xml lt property gt lt name gt hive log dir lt name gt lt value gt c apps dist hive 0 11 0 1 3 0 1 0302 logs lt value gt lt property gt Listing 13 7 shows the default location of the log file for Hive which is the logs folder The log file is created with the name of hive log Log Files Any Data Definition Language DDL or Data Manipulation Language DML commands are logged in the log files For example if you execute an HQL CREATE DATABASE TEST and it gets created successfully you should see similar entries in your hive log file as shown in Listing 13 8 Listing 13 8 hive log 2013 11 15 11 56 49 326 INFO ql Driver PerfLogger java PerflogBegin 100 lt PERFLOG method Driver run gt 2013 11 15 11 56 49 326 INFO ql Driver PerfLlogger java PerflogBegin 100 lt PERFLOG method TimeToSubmit gt 2013 11 15 11 56 49 326 INFO ql Driver PerfLogger java PerflogBegin 100 lt PERFLOG method compile gt 2013 11 15 11 56 49 327 INFO parse ParseDriver ParseDriver java parse 179 Parsing command create database test 2013 11 15 11 56 49 329 INFO parse ParseDriver ParseDriver java parse 197 Parse Completed 2013 11 15 11 56 49 331 INFO ql Driver Driver java compile 442 Semantic Analysis Completed 2013 1
34. 138 MetaStore definition 137 table creation CLUSTERED BY clause 130 democlustercontainer 131 demo stock data 130 external and internal 129 LOAD commands 134 135 PARTITIONED BY clause 129 querying data 136 schema verification 133 SKEWED BY clause 130 stock_analysis 132 133 StockData folder 131 uploaded files list 131 WASB 130 testing advanced options dialog box 142 configuration 140 connection establishment 141 New Data Source wizard creation 139 System DSN tab 139 143 Windows Azure HDInsight Emulator 143 144 Hive Oozie storage configuration 29 HiveQL 135 Hive source component ADO NET source 176 hive table columns 179 Preview Hive query results 178 table selection 177 Infrastructure as a Service IaaS 7 Installer logs troubleshooting visual studio deployments see Troubleshooting visual studio deployments types deployment error and process 206 207 HDInsight install log 208 211 install uninstall logs 208 re imaging status entries 207 VM provisioning 207 J K Javascript Object Notation JSON 3 JobHistoryServer 220 JobTracker 219 L ListClusters 45 Log4j framework 194 Logging mechanism error log file 190 191 Log4j framework 194 log4j log files 191 Service Trace Logs 187 190 INDEX WASB 201 Windows Azure HDInsight Emulator 203 Windows ODBC tracing 198 wrapper logs 190 MapReduce attempt file 226 compression 225 concatenation file 226 core site x
35. 2 out of 2 Hive is a common choice in the Hadoop world SQL users take no time to get started with Hive because the schema based data structure is very familiar to them Familiarity with SQL syntax also translates well into using Hive Pig Jobs Pig is a set based data transformation tool that works on top of Hadoop and cluster storage Pig offers a command line application for user input called Grunt and the scripts are called Pig Latin Pig can be run on the name node host or client machine and it can run jobs that read data from HDFS WASB and compute data using the MapReduce framework The biggest advantage again is to free the developer from writing complex MapReduce programs Configuration File The configuration file for Pig is pig properties and it is found in the C apps dist pig 0 11 0 1 3 1 0 06 conf directory of the HDInsight name node It contains several key parameters for controlling job submission and execution Listing 13 15 highlights a few of them Listing 13 15 pig properties file Verbose print all log messages to screen default to print only INFO and above to screen verbose true Exectype local mapreduce mapreduce is default exectype mapreduce The following two parameters are to help estimate the reducer number pig exec reducers bytes per reducer 1000000000 pig exec reducers max 999 Performance tuning properties pig cachedbag memusage 0 2 pig skewedjoin reduce memusagea 0 3 pig exec nocombiner
36. 25594 Adding Newtonsoft Json 4 5 11 to HadoopClient Successfully added Newtonsoft Json 4 5 11 to HadoopClient Adding Microsoft Hadoop MapReduce 0 9 4951 25594 to HadoopClient Successfully added Microsoft Hadoop MapReduce 0 9 4951 25594 to HadoopClient Setting MRLib items CopyToOutputDirectory true Note The version numbers displayed while installing the NuGet package might change with future version updates of the SDK Once the NuGet package has been added add a reference to the dll file in your code using Microsoft Hadoop MapReduce Once these required references are added you are ready to code your MapReduce classes and job submission logic in your application Submitting a Custom MapReduce Job In the previous chapter we already created the Constants cs class to re use several constant values like your Azure cluster url storage account containers and so on The code in the class file should look similar to Listing 5 1 Listing 5 1 The Constants class using System using System Collections Generic using System Ling using System Text namespace HadoopClient public class Constants 60 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER public static Uri azureClusterUri new Uri https democluster azurehdinsight net 443 public static string clusterName democluster public static string thumbprint Your Certificate _Thumbprint public static Guid subscriptionId new Gui
37. Application 19 KB Figure 5 7 MRRunner exe utility You can launch a command prompt and run the MRRunner exe with appropriate arguments Specify the HadoopClient d11 from the project s bin debug folder as in the following example E HadoopClient HadoopClient MRLib gt MRRunner dll E HadoopClient HadoopClient bin Debug HadoopClient d11 Note In case you are using a release build for your project you will find the HadoopClient d11 file in your project s bin release folder You also need to change the Project output type to Class Library to generate the HadoopClient d11 from the Project gt Properties menu On successful completion of the job you will see output similar to Listing 5 15 Listing 5 15 MRRunner output Output folder exists deleting File dependencies to include with job Auto detected C windows Microsoft Net assembly GAC_MSIL PresentationFramework v4 0_4 0 0 0 31bf3856ad364e35 PresentationFramework d11 Auto detected C windows Microsoft Net assembly GAC_32 PresentationCore 86 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER v4 0 4 0 0 0 31bf3856ad364e35 PresentationCore d11 Auto detected C windows Microsoft Net assembly GAC_MSIL UIAutomationProvider v4 0_4 0 0 0 31bf3856ad364e35 UIAutomationProvider d11 Auto detected C windows Microsoft Net assembly GAC_MSIL UIAutomationTypes v4 0_4 0 0 0 31bf3856ad364e35 UIAutomationTypes d11 Auto detected C windows Microsoft Net assembly G
38. AzureHDInsightCluster Subscription incorrectsub debug DEBUG Severity Error One or more errors occurred at Microsoft WindowsAzure Management HDInsight Cmdlet PSCmdlets GetAzureHDInsightClusterCmdlet EndProcessing One or more errors occurred Unable to resolve subscription incorrectsub at Microsoft WindowsAzure Management HDInsight Cmdlet GetAzureHDInsightClusters AzureHDInsightCommandExtensions ResolveSubscriptionId String subscription at Microsoft WindowsAzure Management HDInsight Cmdlet GetAzureHDInsightClusters AzureHDInsightCommandExtensions GetSubscriptionCertificateCredentials IAzureHDInsightCommonCommandBase command at Microsoft WindowsAzure Management HDInsight Cmdlet Commands CommandImplementations GetAzureHDInsightClusterCommand lt EndProcessing gt d__2 MoveNext Summary The Windows Azure HDInsight Service writes the sequence of installations during cluster deployments in specific log files These log files are the ones to fall back to if your cluster provisioning process encounters errors Using a cloud service limits your control of the operations in compared to the control you have on your on premises box products This chapter taught you about troubleshooting mechanisms and places to start investigating when something goes wrong You also learned about the different debugging mechanisms available with Visual Studio and Windows Azure PowerShell when provisioning your HDInsight clusters programmatically I
39. Constants storageAccountKey CloudBlobClient blobClient storageAccount CreateCloudBlobClient CloudBlobContainer blobContainer blobClient GetContainerReference Constants container CloudBlockBlob blockBlob blobContainer GetBlockBlobReference example data WordCountOutput part r 00000 blockBlob DownloadToStream stream stream Position 0 StreamReader reader new StreamReader stream Console Write Done Word counts are n Console WriteLine reader ReadToEnd Add a call to this method in Program cs and run the program You should see the job completing with success and the words with their counts should be displayed in the console Thus the NET Framework exposes two different ways to submit MapReduce jobs to your HDInsight clusters you can write your own NET MapReduce classes or you can choose to run any of the existing ones bundled in jar files Submitting a Hive Job As stated earlier Hive is an abstraction over MapReduce that provides a SQL like language that is internally broken down to MapReduce jobs This relieves the programmer of writing the code and developing the MapReduce infrastructure as described in the previous section Adding the References Launch the NuGet Package Manager Console and import the Hive NuGet package by running the following command install package Microsoft Hadoop Hive This should import the required dll along with any dependencies it may have You will see
40. Data Warehouse combine data from multiple tables Figure 1 5 Data collection and analytics Enterprise BI is a topic in itself and there are several factors that require special consideration when integrating a Big Data solution such as HDInsight with an enterprise BI system You should carefully evaluate the feasibility of integrating HDInsight and the benefits you can get out of it The ability to combine multiple data sources in a personal data model enables you to have a more flexible approach to data exploration that goes beyond the constraints of a formally managed corporate data warehouse Users can augment reports and analyses of data from the corporate BI solution with additional data from HDInsight to create a mash up solution that brings data from both sources into a single consolidated report Figure 1 6 illustrates HDInsight deployment as a powerful BI and reporting tool to generate business intelligence for better decision making 10 CHAPTER 1 INTRODUCING HDINSIGHT SQL Server HDinsight Windows Azure ES SQL Database Cleanse Transform and validate SQL Server Reporting Services External Data Windows Azure SQL Reporting Streaminsight Other Formats Other Reporting solutions Stream Data Figure 1 6 Enterprise BI solution Data sources for such models are typically external data that can be matched on a key to existing data in your data store so th
41. Explorer to view the output folder in your C Output directory as shown in Figure 6 8 iGO gt Computer gt Local Disk C output Organize e Include in library Share with New folder db Favorites Name Date modified Type EC Desktop SUCCESS 7 30 2013 10 35AM File Jp Downloads part r 00000 7 30 2013 10 35AM File 1 Recent Places Figure 6 8 The output folder in the local file system As indicated before because Windows does not understand shell scripts for Linux sh files all the command scripts and executables are implemented through Windows command files cmd files You can use them directly from the command prompt as you would do in Linux thus providing a complete abstraction to end users on Windows For example to start or stop your cluster you can use the commands e stop master cmd e stop slave cmd 95 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Detailed descriptions of all the core Hadoop commands are beyond the scope of this book If you are interested you can refer to Apache s user manual on Hadoop commands for a complete listing and description at http hadoop apache org docs r1 0 4 commands_manual html A very important thing to re iterate here is that the HDInsight Service actually simulates the HDFS behaviors for the end user Actually all the cluster data is stored in Windows Azure Storage Blob WASB in cluster specific containers If you remember the core site xml file
42. Hadoop on Windows Copyright 2014 by Debarchan Sarkar This work is subject to copyright All rights are reserved by the Publisher whether the whole or part of the material is concerned specifically the rights of translation reprinting reuse of illustrations recitation broadcasting reproduction on microfilms or in any other physical way and transmission or information storage and retrieval electronic adaptation computer software or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location in its current version and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law ISBN 13 pbk 978 1 4302 6055 4 ISBN 13 electronic 978 1 4302 6056 1 Trademarked names logos and images may appear in this book Rather than use a trademark symbol with every occurrence of a trademarked name logo or image we use the names logos and images only in an editorial fashion and to the benefit
43. INFO ql Driver Driver java getSchema 259 Returning Hive schema Schema fieldSchemas null properties null 2013 11 15 13 37 11 436 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method compile start 1384522631433 end 1384522631436 duration 3 gt 2013 11 15 13 37 11 437 INFO ql Driver Perflogger java PerfLogBegin 100 lt PERFLOG method Driver execute gt 2013 11 15 13 37 11 437 INFO ql Driver Driver java execute 1066 Starting command create database test 2013 11 15 13 37 11 437 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method TimeToSubmit start 1384522631433 end 1384522631437 duration 4 gt 2013 11 15 13 37 11 508 ERROR exec Task SessionState java printError 432 Database test already exists 2013 11 15 13 37 11 509 ERROR ol Driver SessionState java printError 432 FAILED Execution Error return code 1 from org apache hadoop hive ql exec DDLTask 2013 11 15 13 37 11 510 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method Driver execute start 1384522631437 end 1384522631510 duration 73 gt 2013 11 15 13 37 11 511 INFO ql Driver Perflogger java PerfLogBegin 100 lt PERFLOG method releaseLocks gt 2013 11 15 13 37 11 512 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method releaseLocks start 1384522631511 end 1384522631512 duration 1 gt 2013 11 15 13 37 11 512 INFO ql Driver Perflogger java PerfLogBegin 100 lt PERFLOG metho
44. If you are running a multinode Hadoop cluster the logs you will find here are not centrally aggregated To put together a complete picture you will need to check and verify each Task Node s logs userlogs directory for their output and then create the full log history to understand what went wrong in a particular job In a Hadoop cluster the entire job submission execution and history management process is done by three types of services e JobTracker JobTracker is the master of the system and it manages the jobs and resources in the cluster TaskTrackers The JobTracker schedules and coordinates with each of the TaskTrackers that are launched to complete the jobs e TaskTrackers These are the slave services deployed on Data Nodes or Task Nodes They are responsible for running the map and reduce tasks as instructed by the JobTracker 219 CHAPTER 13 TROUBLESHOOTING JOB FAILURES e JobHistoryServer This is a service that serves historical information about completed jobs JobHistoryServer can be embedded within the JobTracker process If you have an extremely busy cluster it is recommended that you run this as a separate service This can be done by setting the mapreduce history server embedded property to true in the mapred site xml file Running this service consumes considerable disk space because it saves job history information for all the jobs Note In Hadoop versions 2 0 and beyond MapReduce will be replaced by YARN or M
45. TableFacebook csv blobName Tablefacebook csv Get the storage account key Select AzureSubscription subscriptionName storageaccountkey get azurestoragekey storageAccountName Primary Create the storage context object destContext New AzureStorageContext StorageAccountName storageAccountName StorageAccountKey storageaccountkey Copy the file from local workstation to the Blob container Set AzureStorageBlobContent File fileName Container containerName Blob blobName context destContext Note Repeat these steps with other csv files in the folder by changing the filename variable and blobname variables and rerun Set AzureStorageBlobContent 130 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Once the files are uploaded log on to the name node using remote desktop and execute the command hadoop fs ls inthe Hadoop Command Line This should list all the files you just uploaded as shown in Listing 8 2 Listing 8 2 Listing the uploaded files c apps dist hadoop 1 2 0 1 3 1 0 06 gt hadoop fs ls Found 10 items IWXYwxXrwx 1 15967 2013 11 24 06 43 TableFacebook csv IWXYWXTWwx 1 130005 2013 11 24 06 42 TableGoogle csv YWXYWXIWx 1 683433 2013 11 24 06 42 TableIBM csv IWXrwxXrwx 1 370361 2013 11 24 06 43 TableMSFT csv IWXYwxXrwx 1 341292 2013 11 24 06 42 TableOracle csv IWXIWXIWX 1 341292 2013 11 24 06 43 TableApple csv You can also use the Azure portal to navigate to the stor
46. a DS ro Ces Figure 9 4 Selecting a provider 149 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS The next screen in the wizard accepts the connection string for the data source You can choose to build the connection string instead of writing it manually So click on the Build button to bring up the Data Link file where you can select the HadoopOnAzure DSN we created earlier and provide the correct credentials to access the HDInsight cluster Make sure to select Allow saving password so that the password is retained in the underlying PowerPivot Table Import Wizard Also verify that Test Connection succeeds as shown in Figure 9 5 If you provide all the details correctly you should also be able to enumerate the default database HIVE in the Enter the initial catalog to use drop down list Table Import Wizard Specify a Connection String Type or paste a connection string A connection string contains the information _ _ _ _ needed to connect to a particular data source Friendly name for this connection HiveConnection Microsoft Data Link Connection String 2 Enter information to log on to the server Username admin Password eeeesscscseces E Blank password V Allow saving password oo 3 Enter the initial catalog to use kaa egeckaescdl Frit Figure 9 5 Configuring the connection string The Table Import Wizard dialog should be populated w
47. a great way to isolate any external issues when you are facing errors while submitting jobs using PowerShell or NET Hadoop Web Interfaces Core Hadoop provides a couple of web interfaces to monitor your cluster and by default they are available at the desktop of the name node These portals can provide useful details about the cluster health usage and MapReduce job execution statistics The shortcuts to these portals are created on the desktop during the Azure virtual machine VM provisioning process as shown in Figure 6 10 They are e Hadoop MapReduce Status e Hadoop Name Node Status ai Hadoop Command Line Hadoop Name Node Status Figure 6 10 Shortcuts to the web portals Hadoop MapReduce Status The Hadoop MapReduce portal displays information on job configuration parameters and execution statistics in terms of running completed failed jobs The portal also shows job history log files You can drill down on individual jobs and examine the details The portal is referred to as the JobTracker portal because each MapReduce operation is submitted and executed as a job in the cluster The tracker portion of the portal is basically a Java based web application that listens on port 50030 The URL for the portal is http lt NameNode_IP_Address gt 50030 jobtracker jsp Figure 6 11 shows the MapReduce status or the JobTracker status portal when it is launched 104 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE jobtrackerho
48. and deploy via simplified packaging and configuration These improvements will enable IT to apply consistent security policies across Hadoop clusters and manage them from a single pane of glass on System Center 2012 Further Microsoft SQL Server and its powerful BI suite can be leveraged to apply analytics and generate interactive business intelligence reports all under the same roof For the Hadoop based service on Windows Azure Microsoft has further lowered the barrier to deployment by enabling the seamless setup and configuration of Hadoop clusters through an easy to use web based portal and offering Infrastructure as a Service IaaS Microsoft is currently the only company offering scalable Big Data solutions in the cloud and for on premises use These solutions are all built on a common Microsoft Data Platform with familiar and powerful BI tools HDInsight is available in two flavors that will be covered in subsequent chapters of this book e Windows Azure HDInsight Service This is a service available to Windows Azure subscribers that uses Windows Azure clusters and integrates with Windows Azure storage An Open Database Connectivity ODBC driver is available to connect the output from HDInsight queries to data analysis tools e Windows Azure HDInsight Emulator This is a single node single box product that you can install on Windows Server 2012 or in your Hyper V virtual machines The purpose of the emulator is to provide a development env
49. business intelligence BI tools and analytics systems Therefore Big Data queries are typically batch operations that depending on the data volume and query complexity might take considerable time to return a final result However when you consider the volumes of data that Big Data solutions can handle which are well beyond the capabilities of traditional data storage systems the fact that queries run as multiple tasks on distributed servers does offer a level of performance that cannot be achieved by other methods Unlike most SQL queries used with relational databases Big Data queries are typically not executed repeatedly as part of an application s execution so batch operation is not a major disadvantage Is Big Data the Right Solution for You There is a lot of debate currently about relational vs nonrelational technologies Should I use relational or non relational technologies for my application requirements is the wrong question Both technologies are storage mechanisms designed to meet very different needs Big Data is not here to replace any of the existing relational model based data storage or mining engines rather it will be complementary to these traditional systems enabling people to combine the power of the two and take data analytics to new heights The first question to be asked here is Do I even need Big Data Social media analytics have produced great insights about what consumers think about your product
50. by Windows Azure Storage Responses with an http status code of 500 or 503 indicate that a request has been throttled One way to collect Windows Azure Storage responses is to turn on storage logging as described in http www windowsazure com en us manage services storage how to monitor a storage account configurelogging This is also discussed earlier in this book in Chapter 11 To avoid throttling you can adjust parameters in the WASB driver self throttling mechanism The WASB driver is the HDInsight component that reads data from and writes data to WASB The driver has a self throttling mechanism that can slow individual virtual machine VM transfer rates between a cluster and WASB This effectively slows the overall transfer rate between a cluster and WASB The rate at which the self throttling mechanism slows the transfer rate can be adjusted to keep transfer rates below throttling thresholds By default the self throttling mechanism is exercised for clusters with n number of nodes gt 7 and it increasingly slows transfer rates as n increases The default rate at which self throttling is imposed is set at cluster creation time based on the cluster size but it is configurable after cluster creation The self throttling algorithm works by delaying a request to WASB in proportion to the end to end latency of the previous request The exact proportion is determined by the following parameters configurable in core site xml or at job submissio
51. clusterDetails Name AutomatedHDICluster clusterDetails Location West US clusterDetails DefaultStorageAccountName Constants storageAccount clusterDetails DefaultStorageAccountKey Constants storageAccountKey clusterDetails DefaultStorageContainer AutomatedHDIContainer clusterDetails UserName Constants clusterUser clusterDetails Password Constants clusterPassword clusterDetails ClusterSizeInNodes D Cre z Console WriteLine Created cluster cluster ConnectionUrl public static void DeleteCluster public static void DoHiveOpeartions v 100 4 gt Name Value Type Name Lan clusterDetails Microsoft WindowsAzure Management HDinsight ClusterProvision Microso HadoopClient exe HadoopClient Program Create C I amp AdditionalStorageAccoi Count 0x00000000 System HadoopClient exe HadoopClient Program Main s C ClusterSizelnNodes 0x00000002 int External Code DefaultStorageAccount 9ddJ2 LR fiJo qVVFstBN Fml UKduHYW5jfwpKIrHQqJgwY rz Q string DefaultStorageAccount datadork blob core windows net Q string DefaultStorageContaine AutomatedHDIContainer Q string HiveMetastore null Microsot Location West US Q string v Locals Watch 1 Call Stack Immediate Window Figure 12 3 Using breakpoints in Visual Studio Breakpoints are one of the most convenient ways to debug a program from Visual Studio To learn more about setting removing and manipulating breakpoints
52. connection manager 171 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES Click the New button in the Configure Ado Net Connection Manager window to create the new connection From the list of providers Select Net Providers gt ODBC Data Provider and click OK in the Connection Manager window as shown in the Figure 10 7 I Connection Manager 3 Provider Net Providers SqlClient Data Provider LG Net Providers A SqlClient Data Provider A OracleClient Data Provider 8 Data Provider 4 ODP NET Version 11 1 0 6 0 A MySQL Data Provider B Met Providers for OleDb elect or enter a database name Attach a database file Browse Logical name Figure 10 7 Choosing the NET ODBC Data Provider Select the HadoopOnAzure DSN from the User DSN or System DSN list depending upon the type of DSN you created in Chapter 8 Provide the HDInsight cluster credentials and Test Connection should succeed as shown in Figure 10 8 172 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES I Connection Manager ZS Provider Net Providers Odbc Data Provider y Data source specification Connection Use user or system data source name eset Use connection string Dsn HadoopOnAzure uid admin Build Login information Username admin Password secceceseseesse Figure 10 8 Test the connection to Hive Creating the Destination SQL Connection
53. during your job submissions because all your input files are in WASB and all the output files written by Hadoop are also in your cluster s dedicated WASB container WASB Authentication One of the most common errors encountered during cluster operations is the following org apache hadoop fs azure AzureException Unable to access container lt container gt in account lt storage_account gt using anonymous credentials and no credentials found for them in the configuration This message essentially means that the WASB code couldn t find the key for the storage account in the configuration Typically the problem is one of two things e The key is not present in core site xml Or it is there but not in the correct format This is usually easy to check assuming you can use Remote Desktop to connect to your cluster Take a look in the cluster in C apps dist hadoop 1 2 0 1 3 1 0 06 conf core site xml for the configuration name value pair with the name being fs azure account key lt account gt e The key is there in core site xml but the process running into this exception is not reading core site xml Most Hadoop components MapReduce Hive and so on read core site xml from that location for their configuration but some don t For example Oozie has its own copy of core site xml that it uses This is harder to chase but if you re using a non standard Hadoop component this might be the culprit You should confirm your st
54. error log file for each service These record the log messages for the running java services If there are any errors encountered while the service is already running the stack trace of the error is logged in the above files The error logs are of extension ert log and they again reside on the same directory as the output and wrapper files For example if you have permission issues in accessing the required files and folders you may see an error message similar to below in your namenode err log file 13 08 16 19 07 16 WARN impl MetricsSystemImpl Source name ugi already exists 13 08 16 19 07 16 INFO util GSet VM type 64 bit 13 08 16 19 07 16 INFO util GSet 2 max memory 72 81875 MB 13 08 16 19 07 16 INFO util GSet capacity 2 23 8388608 entries 13 08 16 19 07 16 INFO util GSet recommended 8388608 actual 8388608 13 08 16 19 07 16 INFO namenode FSNamesystem fsOwner admin 13 08 16 19 07 16 INFO namenode FSNamesystem supergroup supergroup 13 08 16 19 07 16 INFO namenode FSNamesystem isPermissionEnabled false 13 08 16 19 07 16 INFO namenode FSNamesystem dfs block invalidate limit 100 13 08 16 19 07 16 ERROR namenode FSNamesystem FSNamesystem initialization failed java io FileNotFoundException c hdfs nn current VERSION Access is denied at java io RandomAccessFile open Native Method at java io RandomAccessFile lt init gt RandomAccessFile java 233 at org apache hadoop hdfs server common Storage StorageDirectory read S
55. false opt multiquery true pig tmpfilecompression false These properties help you control the number of mappers and reducers and several other performance tuning options dealing with the internal dataset joins and memory usage 234 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Tip Avery important debugging trick is to use the exectype parameter in Pig By default it is set to exectype mapreduce which means you need access to your cluster and its storage to run your scripts You can set this to exectype local for debugging To run the scripts in local mode no Hadoop or HDFS installation is required All files are installed and run from your local host and file system It is also possible to run Pig in Debug mode which prints out additional messages in the console during job execution Debug mode also provides higher logging levels that can help with isolation of a given problem The following command starts the Pig console in Debug mode c apps dist pig 0 11 0 1 3 1 0 06 bin gt pig cmd Ddebug DEBUG For every Pig job there is a job configuration file that gets generated The file is located at C apps dist hadoop 1 2 0 1 3 1 0 06 logs directory and named as job_jobId_conf xml Log Files Pig does not have a log file directory of its own Rather it logs its operations in the C apps dist hadoop 1 2 0 1 3 1 0 06 logs folder The name of the log file is pig_ lt random_number gt log This file records a Pig Stack Trace for ever
56. folder E Computer Ji job_201312100246_0022 12 10 2013 5 21 AM Pie Folder Ji job_201312100246_0023 12 10 2013 5 22 AM File folder Ska Network J job_201312100246_0024 12 10 20135 23 AM File folder Ji job_201312100246_0025 12 10 2013 5 35 AM File folder J job_201312100246_0026 12 10 2013 5 34 AM File folder Ji job_201312100246_0027 12 10 2013 5 34 AM File Folder Figure 11 5 The userlogs folder There are a few other log files that use the log4j framework These log other cluster operations specifically job executions They are classified based on their respective project For example hadoop log This file records only the MapReduce job execution output Since it s the data nodes that actually carry out individual Map and Reduce tasks this file is normally populated in the data nodes It is found in the C apps dist hadoop 1 2 0 1 3 1 0 06 logs directory templeton log This file logs the execution statistics of the jobs that are submitted using the Hadoop streaming interface Job submissions using NET SDK and PowerShell fall into this category The log is available in the C apps dist hcatalog 0 11 0 1 3 1 0 06 logs folder hive log Found in the C apps dist hive 0 11 0 1 3 1 0 06 logs folder this file records the output of all Hive job submissions It is useful when a Hive job submission fails before even reaching the MapReduce phase oozie log Oozie web services streaming operations are logged to this file It is present in
57. however historical versions are preserved The old file names are appended with a _ lt timestamp gt value each time they are purged and rolled over The most current log files are in the format hadoop namenode lt Hostname gt log hadoop datanode lt Hostname gt log hadoop secondarynamenode lt Hostname gt log and so on The Hostname is the host where the service is running on These are pretty similar to the service error log files discussed in the previous section and record the stack traces of the service failures A typical name node log looks similar to the following snippet after a successful startup 2013 08 16 21 32 39 324 INFO org apache hadoop hdfs server namenode NameNode STARTUP_MSG RRR E ARK KH AA A KK KAA AK KK K KK A A KK KK KKK KKK K KK KKK STARTUP_MSG Starting NameNode STARTUP_MSG host lt HostName gt lt IP Address STARTUP_MSG args STARTUP_MSG version 1 2 0 STARTUP_MSG build git github com hortonworks hadoop monarch git on branch no branch r 99a88d4851ce171cf57a621910bb293950e6358 compiled by jenkins on Fri Jul 19 22 07 17 Coordinated Universal Time 2013 AEA AAA A AK A A A A A A K KK A A A K K K K KKK K KK K K K KK KKK K AK A A KKK K KK KK 191 CHAPTER 11 LOGGING IN HDINSIGHT 2013 08 16 21 32 40 167 WARN org apache hadoop metrics2 impl MetricsSystemImp1 Source name ugi already exists 2013 08 16 21 32 40 199 INFO org apache hadoop hdfs util GSet VM type 64 bit 2013 08 16 21 32 40
58. its supporting projects with the necessary Windows services on the name node and data nodes For example if a node re imaging has taken place there will be re imaging status entries at the very beginning of the DeploymentAgent log file as shown in Listing 12 1 206 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS Listing 12 1 Node re imaging status entries 11 10 2013 12 45 20 PM 00 00 11 10 2013 12 44 39 PM 00 00 XXXXXXXXXXXXXXXXXXXXXXXXXX y LSotopeWorkerNode IsotopeWorkerNode_IN 1 4 XXXXXXXXXXXXXXXXXX y 2224 1020 SetupLogEntryEvent 1002 Info null MdsLogger ClusterSetup Microsoft Hadoop Deployment Engine Azure reimaging state 3 REIMAGE DATA LOSS Services do not exist Java directory does not exist Fresh installation 1 0 0 0 xXXXXXXXXXXXXXXX y XXXXXXXXXXXXXX False null null null null null 2013 11 10 12 44 39 480 Diagnostics 0000000000000000000 0000000055834815115 11 10 2013 12 44 00 PM 00 00 If there are any errors while deploying the Apache components or the services due to some race condition while accessing the file system you may see log entries in this file similar to Listing 12 2 Listing 12 2 Error during deployment Diagnostics Information 1001 OrigTS 2013 07 20 21 47 21 109 EventId 1002 TraceLevel Info DeploymentEnvironment ClassName MdsLogger Component ClusterSetup ComponentAction Microsoft Hadoop Deployment Engine Details Manually taking the backup off all files as the dir
59. love and encouragement have been the fuel that enabled me to do the impossible You ve been the bones of my spine keeping me straight and true You re my blood making sure it runs rich and strong You re the beating of my heart I cannot imagine a life without you Love you so much MA Contents AT CR E xiii About the Technical Reviewers scsscsssssssssssssssssessessessescessescessesusssesesesensuesensussensussensusess XV Acknowledgments EE xvii d RT s iiiiiiaiisiniisiiuiiuviiiusaivaisisisnansunninaeiiuviiivsaivaisdn aiena rouo idau i duydu soiva uidi eisiaa ia iiaii xix Chapter 1 Introducing HDINSIght ssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 1 What Is Big Data and Why DH 1 How i Big Data Ku RE 3 Is Big Data the Right Solution for YOU cscsscssssssessessessessessessessessessesseseeseeseesoeseeseesoeses 3 The Apache Hadoop Ecosystem s sssssssssssessessessessessessessesoessessessesnessesnenoesoesoesoesoesoesess 5 Microsoft HDInsight Hadoop on Windows ssesesssssesessesesseseesassaesaesassaesaesassatsaesatsatsateatoas 7 Combining HDinsight with Your Business Processes ssssesssssssssrusnsunnernnnnunnnnnnnnunnnnnnnnunnnnnnnnn 9 SUMMATY E 12 Chapter 2 Understanding Windows Azure HDinsight ServiCe sssssussssnnnnnnnnnnnnnnnnnnnnnn 13 Microsoft s Cloud Computing Platform c cssssssesssssessssessesesesssessesessessees
60. lt name gt lt value gt 8 lt value gt lt property gt lt property gt lt name gt mapred reduce max attempts lt name gt lt value gt 8 lt value gt lt property gt lt property gt lt name gt mapred task timeout lt name gt lt value gt 600000 lt value gt lt property gt lt property gt lt name gt mapred max split size lt name gt lt value gt 536870912 lt value gt lt property gt If you have active Hadoop clusters there are numerous scenarios in which you have to come back and check the properties in Listing 13 4 Most of these properties come into the picture when there are job optimization or tuning requirements that cause jobs to take an unusually long time to complete For several other types of obvious errors that may occur during a job submission the log files can be a source of a great deal of information Log Files I covered the different types of logs generated by Hadoop and the HDInsight service in detail in Chapter 11 However let s go quickly through the logging infrastructure for MapReduce jobs again The log files are normally stored in C apps dist hadoop 1 2 0 1 3 1 0 06 logs and C apps dist hadoop 1 2 0 1 3 1 0 06 bin folders by default The jobtracker trace log file resides in the bin directory and it logs the job startup command and the process id A sample trace would be similar to Listing 13 5 Listing 13 5 jobtracker trace log HadoopServiceTraceSource Information 0 T
61. messages similar to the following one in the logs 2013 08 16 21 32 43 152 ERROR org apache hadoop security UserGroupInformation PriviledgedActionException as hdp cause java io I0Exception File mapred system jobtracker info could only be replicated to 0 nodes instead of 1 192 CHAPTER 11 LOGGING IN HDINSIGHT Note Each message in the logs is marked by levels like INFO ERROR and so on This level of verbosity in the error logs can be controlled using the Log4j framework Figure 11 3 shows a screenshot of the Hadoop log files for my democluster z Lk v Computer v Local Disk C v apps dist hadoop 1 2 0 1 3 1 0 06 logs v y Includeinlibrary e Share with New folder egen Name Date modified Type Size esktop 1 history 12 10 20135 35 4M File folder awnloads __ hadoop log 12 10 2013 2 47 AM Text Document OKB scent Places __ hadoop copyFromLocal RDO0155D67172B log 12 10 20132 48 AM Text Document OKB f ___ hadoop jobtracker RD00155D67172B log 12 10 2013 5 35 4M Text Document 174 KB wu __ hadoop mkdir RDO0155D67172B log 12 10 2013 2 48 AM Text Document OKB one __ hadoop namenode RDO0155D67172B log 12 10 2013 7 52 AM Text Document 403 KB BS __ hadoop put RDO0155D67172B log 12 10 2013 2 47 AM Text Document 0 KB deos __ hadoop rmr RDO0155D67172B log 12 10 2013 2 47 AM Text Document OKB __ hadoop secondarynamenode RDOOISSD671 12 10 20137 52 AM Text Document 56 KB puter __ hadoop stat RD00155D67172B log 1
62. nodes and decommissioning nodes You can also navigate through the HDFS and load chunks of data from job output files for display 91 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Ka democluster democluster azurehdinsight net Remote Desktop Connection aaa w AS a Recycle Bin Hadoop Command Line Figure 6 5 The Name Node desktop You ll use the command line a lot so let s look at that next in the sections to follow Hadoop Command Line Traditional Linux based Apache Hadoop uses shell scripting to implement the commands Essentially most of the commands are sh files that need to be invoked from the command prompt Hadoop on Windows relies on command files cmd and PowerShell scripts ps1 to simulate the command line shell HDInsight has unique capabilities to talk to WASB hence you can operate natively with your Azure storage account containers in the cloud To access the Hadoop command prompt double click the shortcut Hadoop Command Line on your name node s desktop See Figure 6 6 92 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE x Hadoop Command Line fs run a generic filesystem user client balancer run a cluster balancing utility snapshot Diff diff two snapshots of a directory or diff the current directory contents with a snapshot Hadoop lsSnapshottableDir list all snapshottable dirs owned by the current user ommand Line oiv apply the offline fsimage viewer to an fsimage fetch
63. of the entire data So there will be three times 3x the required I O during the mapping phase a phenomenon known as data I O explosion The goal is to spill only once 1x during the mapping phase which is a goal that can be achieved only if you carefully select the correct configuration for your Hadoop MapReduce job The memory buffer per data record consists of three parts The first part is the offset of the data record stored as a tuple That tuple requires 12 bytes per record and it contains the partition key the key offset and a value offset The second part is the indirect sort index requiring four bytes per record Together these two parts constitute the metadata for a record for a total of 16 bytes per record The third part is the record itself which is the serialized key value pair requiring R bytes where R is the number of bytes of data If each mapper handles N records the recommended value of the parameter that sets the proper configuration in the mapred site xml is expressed as follows lt property gt lt name gt io sort mb lt name gt lt value gt N 16 R 1024 1024 lt value gt lt property gt By specifying your configuration in this way you reduce the chance of unwanted spill operations Hive Jobs The best place to start looking at a Hive command failure is the Hive log file which can be configured by editing the hive site xml file The location of the hive site xml file is the C apps dist hive 0 11 0
64. of the trademark owner with no intention of infringement of the trademark The use in this publication of trade names trademarks service marks and similar terms even if they are not identified as such is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights While the advice and information in this book are believed to be true and accurate at the date of publication neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty express or implied with respect to the material contained herein President and Publisher Paul Manning Lead Editor Jonathan Gennick Technical Reviewer Scott Klein Rodney Landrum Editorial Board Steve Anglin Mark Beckner Ewan Buckingham Gary Cornell Louise Corrigan James T DeWolf Jonathan Gennick Jonathan Hassell Robert Hutchinson Michelle Lowman James Markham Matthew Moodie Jeff Olson Jeffrey Pepper Douglas Pundick Ben Renow Clarke Dominic Shakeshaft Gwenan Spearing Matt Wade Steve Weiss Coordinating Editor Anamika Panchoo Copy Editor Roger LeBlanc Compositor SPi Global Indexer SPi Global Artist SPi Global Cover Designer Anna Ishchenko Distributed to the book trade worldwide by Springer Science Business Media New York 233 Spring Street 6th Floor New York NY 10013 Phone 1 800 SPRINGER fax 201 348 4505 e mail orders
65. out model distributed computing rather than a scale up model increasing computing and hardware resources for a single server targeted by traditional RDBMS like SQL Server With hardware and storage costs falling drastically distributed computing is rapidly becoming the preferred choice for the IT industry which uses massive amounts of commodity systems to perform the workload However to what type of implementation you need you must evaluate several factors related to the three Vs mentioned earlier e Do you want to integrate diverse heterogeneous sources Variety If your answer to this is yes is your data predominantly semistructured or unstructured nonrelational data Big Data could be an optimum solution for textual discovery categorization and predictive analysis e What are the quantitative and qualitative analyses of the data Volume Is there a huge volume of data to be referenced Is data emitted in streams or in batches Big Data solutions are ideal for scenarios where massive amounts of data needs to be either streamed or batch processed e Whatis the speed at which the data arrives Velocity Do you need to process data that is emitted at an extremely fast rate Examples here include data from devices radio frequency identification device RFID transmitting digital data every micro second or other such scenarios Traditionally Big Data solutions are batch processing or stream processing systems best suited for suc
66. pre requisite In the rest of this chapter we will focus on how to consume Hive data from SSIS using the Hive Open Database Connectivity ODBC driver 167 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES The pre requisites to developing the package shown in this chapter are SQL Server Data Tools which comes as a part of SQL Server 2012 Client Tools and Components and the 32 Bit Hive ODBC Driver installed You will also need either an on premise HDInsight Emulator or a subscription for the Windows Azure HDInsight Service with Hive running on it These details were discussed previously Chapters 2 and 3 Creating the Project SQL Server Data Tools SSDT is the integrated development environment available from Microsoft to design deploy and develop SSIS packages SSDT is installed when you choose to install SQL Server Client tools and Workstation Components from your SQL Server installation media SSDT supports the creation of Integration Services Analysis Services and Reporting Services projects Here the focus is on the Integration Services project type To begin designing the package load SQL Server Data Tools from the SQL Server 2012 program folders as in Figure 10 1 DE En Wow YE SUR eve A Microsoft SQL Server 2012 8 Download Microsoft SQL Server Cor A Import and Export Data 32 bit A Import and Export Data 64 bit e SQL Server Data Tock lbs SQL Server Management Studio Jy Analysis Services Ak Con
67. retention policy aos O Read Requests C Write Requests C Delete Requests Retention in days bk specify 0 if you do not want to set a retention policy Figure 11 7 Select monitoring and logging level Note that as you turn on verbose monitoring and logging the Azure management portal warns you about the additional cost factor through visual clues and tool tips as shown in Figure 11 8 Warning messages will have special icons as well as a brightly colored background to the text 202 CHAPTER 11 LOGGING IN HDINSIGHT logging A This change can have a pricing impact Refer to storage logging help BLOBS A Delete Requests Retention in days WT specify 0 if you do not want to set a retention policy TABLES Read Request Figure 11 8 Pricing impact on logging and monitoring WASB Additionally the Windows Azure s Logging infrastructure provides a trace of the executed requests against your storage account blobs tables and queues You can monitor requests made to your storage accounts check the performance of individual requests analyze the usage of specific containers and blobs and debug storage APIs at a request level To understand this logging infrastructure in depth and learn how to manage the storage analytics in detail refer to the following blog post by the Azure Storage team http blogs msdn com b windowsazurestorage archive 2011 08 03 windows azure storage logging using logs
68. the credentials to connect to the name node Provide the username and password you just created while enabling Remote Desktop for your cluster as shown in Figure 6 4 90 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Figure 6 4 Log on to the name node Once valid credentials are provided you are presented with the desktop of your cluster s name node HDInsight distribution creates three shortcuts for you and you will see them on the name node s desktop as shown in Figure 6 5 The shortcuts are Hadoop Command Line Invokes the command line which is the traditional Windows command prompt launched from the c apps dist hadoop 1 2 0 1 3 1 0 06 directory This is the base for command line executions of the Hadoop commands as well as for commands relating to Hive Pig Sqoop and several other supporting projects Hadoop MapReduce Status This is a Java based web application that comes with the Apache Hadoop distribution The MapReduce status portal displays the MapReduce configurations based on the config file mapred site xml It also shows a history of all the map and reduce task executions in the cluster based on the job id You can drill down to individual jobs and their tasks to examine a MapReduce job execution Hadoop Name Node Status This is also a Java based web portal prebuilt in Apache Hadoop The NameNode status portal displays the file system health as well as the cluster health in terms of the number of live nodes dead
69. the data nodes In the Windows Azure HDInsight service the storage is separated from the cluster itself by default the default Hadoop file system is pointed to Azure blob storage rather than traditional HDFS in HDInsight distribution If you recall we discussed the advantages of using Windows Azure Storage Blob WASB earlier in Chapter 2 This reduces the cluster s dependency on the name node to some extent still the HDInsight name node continues to be an integral part of your cluster You could start a remote desktop session to log on to the name node and get access to the traditional Apache Hadoop web portals and dashboards This also gives you access to the Hadoop command prompt and the various service logs and it is the old fashioned way to administer your cluster It continues to be a favorite for a lot of users who still prefer the command prompt way of doing things in today s world of rich and intuitive user interfaces for almost everything I often find myself in this category too because I believe command line interfaces are the bare minimum and they give you the raw power of your modules by getting rid of any abstractions in between It is also a good practice to operate your cluster using the command shell to test and benchmark performance because it does not have any additional overhead This chapter focuses on some of the basic command line utilities to operate your Hadoop cluster and the unique features that are implemented in the HDI
70. the output data to the file system In your HadoopClient solution add three classes SquareRootMapper SquareRootReducer and SquareRootJob as shown in Figure 5 2 fg Solution HadoopClient 1 project 4 HadoopClient b Properties bp m References bin MRLib pa obj GO app config b Co Constants cs ee vi packages config c Program cs SquareRootdob cs c SquareRootMapper cs EC 9 c SquareRootReducer cs Figure 5 2 Mapper Reducer and Job classes You need to inherit your mapper class from the NET Framework base class MapperBase and override its Map method Listing 5 2 shows the code for the mapper class 62 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Listing 5 2 SquareRootMapper cs using System using System Collections Generic using System Ling using System Text using Microsoft Hadoop MapReduce namespace HadoopClient class SquareRootMapper MapperBase public override void Map string inputLine MapperContext context int input int Parse inputLine Find the square root double root Math Sort double input Write output context EmitKeyValue input ToString root ToString The Map function alone is enough for a simple calculation like determining square roots So your Reducer class would not have any processing code or logic in this case You can choose to omit it because Reduce and Combine are optional operations in a MapReduce job
71. unstructured data Windows Raupe Datamarket Master Data Services CHAPTER 1 INTRODUCING HDINSIGHT Combining HDinsight with Your Business Processes Big Data solutions open up new opportunities for turning data into meaningful information They can also be used to extend existing information systems to provide additional insights through analytics and data visualization Every organization is different so there is no definitive list of ways you can use HDInsight as part of your own business processes However there are four general architectural models Understanding these will help you start making decisions about how best to integrate HDInsight with your organization as well as with your existing BI systems and tools The four different models are A data collection analysis and visualization tool This model is typically chosen for handling data you cannot process using existing systems For example you might want to analyze sentiments about your products or services from micro blogging sites like Twitter social media like Facebook feedback from customers through email web pages and so forth You might be able to combine this information with other data such as demographic data that indicates population density and other characteristics in each city where your products are sold A data transfer data cleansing and ETL mechanism HDInsight can be used to extract and transform data before you load it into your existing databases
72. variables henceforth in the different methods you call from your client applications Using them helps to improve the readability as well as the management of the code Adding the MapReduce Classes Hadoop Streaming is an interface for writing MapReduce jobs in the language of your choice Hadoop SDK for NET is a wrapper to Streaming that provides a convenient experience for NET developers to develop MapReduce programs The jobs can be submitted for execution via the API The command is displayed on the JobTracker web interface and can be used for direct invocation if required 61 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER A NET map reduce program comprises a number of parts which are described in Table 5 1 e Job definition e Mapper Reducer and Combiner classes e Input data e Job executor Table 5 1 The function of NET MapReduce components Component Function Job definition This class has the declarations for Mapper Reducer and Combiner types as well as the job configuration settings Map Reduce and Combine These are the actual classes you use to implement your processing logic Input data The data for the MapReduce job to process Job executor The entry point of your program for example the Main method which invokes the HadoopJobExecutor API In the following section you will create a MapReduce program that calculates the square root of all the integer values provided as input and writes
73. which is explained in Chapter 3 The Hive ODBC Driver One of the main advantages of Hive is that it provides a querying experience that is similar to that of a relational database which is a familiar experience for many business users Additionally the ODBC driver for Hive enables users to connect to HDInsight and execute HiveQL queries from familiar tools like Excel SQL Server Integration Services SSIS PowerView and others Essentially the driver allows all ODBC compliant clients to consume HDInsight data through familiar ODBC Data Source Names DSNs thus exposing HDInsight to a wide range of client applications Installing the Driver The driver comes in two flavors 64 bit and 32 bit Be sure to install both the 32 bit and 64 bit versions of the driver you ll need to install them separately If you install only the 64 bit driver you ll get errors in your 32 bit applications for example Visual Studio when trying to configure your connections The driver can be downloaded and installed from the following site http www microsoft com en us download details aspx id 40886 Once the installation of the driver is complete you can confirm the installation status by checking if you have the Microsoft Hive ODBC Driver present in the ODBC Data Source Administrator s list of drivers as shown in Figure 8 4 137 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC S ODBC Data Source Administrator m User DSN System DSN
74. 0 00 attempt_201312100246_0027_r_000000_0 COMMIT_PENDING 33 33 Figure 6 16 The running tasks in TaskTracker While the JobTracker or the MapReduce service tracker is the master monitoring the overall execution of a MapReduce job the TaskTrackers manage the execution of individual tasks on each slave node Another important responsibility of the TaskTracker is to constantly communicate with the JobTracker If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster HDinsight Windows Services In traditional Hadoop each process like the namenode datanode and so on are known as daemons which stands for Disk and Execution Monitor In simple terms a daemon is a long running background process that answers requests for services In the Windows environment they are called services Windows provides a centralized way to 108 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE view and manage services running in the system through a console known as the Services console Hadoop daemons are translated to Windows services in HDInsight distribution To view the Hadoop services running on your cluster head node click on Start gt Run and type in Services msc This will launch the Services console and you will see the different Apache Hadoop related services as shown in Figure 6 17
75. 00 As you can see the function calls are logged as a pair of Enter and Exit blocks along with the return codes You can verify details like the connection pooling information and ODBC driver version from the trace In the case of an error you will see an error block with a diagnosis DIAG code for further analysis as in the following snippet 1c4 186c HDBC HWND WCHAR SWORD WCHAR SWORD SWORD UWORD EXIT SOLDriverConnectW with return code 1 SQL ERROR 0x008BA108 0x000F0598 0x0F63B45C 3 0x0F63B45C 3 0x00000000 O lt SQL_DRIVER_NOPROMPT gt 3J ist o DIAG 08001 Unable to establish connection with hive server 1 If the ODBC driver you use does not implement its own logging mechanism this standard Windows ODBC trace is the only option to check ODBC API calls and their return codes You can also follow the step by step process in the article at http support microsoft com kb 274551 CHAPTER 11 LOGGING IN HDINSIGHT Note Make sure you turn off system wide ODBC tracing once your data collection is over otherwise it can significantly hurt performance of the entire system Data collection carries with it overhead that you should tolerate only when actively troubleshooting a problem Logging Windows Azure Storage Blob Operations You can configure your storage account to monitor and log operations that pass over the Windows Azure Storage Blob WASB These include operations that you initiate
76. 0000 successfully 2013 11 16 17 28 42 151 INFO org apache hadoop mapred JobTracker Adding task REDUCE attempt_201311120315 0003 r 000000 0 to tip task_201311120315 0003 r 000000 for tracker tracker_workernode2 127 0 0 1 127 0 0 1 49175 2013 11 16 17 28 51 473 INFO org apache hadoop mapred JobInProgress Task attempt_201311120315 0003 r 000000 0 has completed task _201311120315 0003 r 000000 successfully 2013 11 16 17 28 51 484 INFO org apache hadoop mapred JobTracker Adding task JOB_CLEANUP attempt_201311120315 0003 _m_000001_0 to tip task_201311120315 0003 m 000001 for tracker tracker_workernode2 127 0 0 1 127 0 0 1 49175 2013 11 16 17 28 53 734 INFO org apache hadoop mapred JobInProgress Task attempt_201311120315 0003 _m_000001_0 has completed task _201311120315 0003 m 000001 successfully 2013 11 16 17 28 53 735 INFO org apache hadoop mapred JobInProgress Job job_201311120315 0003 has completed successfully 231 CHAPTER 13 TROUBLESHOOTING JOB FAILURES 2013 11 16 17 28 53 736 INFO org apache hadoop mapred JobInProgress JobSummary jobId job_201311120315 0003 submitTime 1384622907254 launchTime 1384622909953 firstMapTaskLaunchTime 1384622917870 firstReduceTaskLaunchTime 1384622922122 firstJobSetupTaskLaunchTime 1384622909966 firstJobCleanupTaskLaunchTime 1384622931484 finishTime 1384622933735 numMaps 1 numSlotsPerMap 1 numReduces 1 numSlotsPerReduce 1 user amarpb queue default status SUCCEEDE
77. 001 e Hive Server Type is set to Hive Server 2 e Authentication Mechanism is set to Windows Azure HDInsight Emulator If the problem persists even when all the preceding items are set correctly try to test basic connectivity from Internet Explorer Navigate to the following URLs which target the same endpoints that ODBC uses Azure https lt cluster gt azurehdinsight net 443 hive servlets thrifths Localhost http localhost 10001 servlets thrifths 241 CHAPTER 13 TROUBLESHOOTING JOB FAILURES A successful test will show an HTTP 500 error where the error page will look like this at the top HTTP ERROR 500 Problem accessing servlets thrifths Reason INTERNAL_SERVER_ERROR This error occurs because the server expects a specific payload to be sent in a request and Internet Explorer doesn t allow for you to do that However the error does mean that the server is running and listening on the right port and in that sense this particular error is actually a success For more help you can turn on ODBC logging as described in Chapter 11 With logging on you can trace each of the ODBC Driver Manager calls to investigate whatever problem is occurring Summary The entire concept of using Azure HDInsight Service is based on the fact it is an elastic service that is a service you can extend as and when required Submitting jobs is the only time you really need to spin up a cluster because your data is always with you residing on
78. 04 42 INFO manager SqlManager Using default fetchSize of 1000 13 12 10 01 04 42 INFO tool CodeGenTool Beginning code generation 13 12 10 01 04 46 INFO manager SqlManager Executing SOL statement SELECT t FROM stock_analysis AS t WHERE 1 0 13 12 10 01 04 47 INFO orm CompilationManager HADOOP_MAPRED HOME is c apps dist hadoop 1 2 0 1 3 1 0 06 13 12 10 01 04 47 INFO orm CompilationManager Found hadoop core jar at c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop core jar Note tmp sqoop hadoopuser compile 72c67877dd976aed8e4a36b3baa4519b stock_analysis java uses or overrides a deprecated API Note Recompile with Xlint deprecation for details 13 12 10 01 04 49 INFO orm CompilationManager Writing jar file tmp sqoop hadoopuser compile 72c67 877dd976aed8e4a36b3baa4519b stock_analysis jar 13 12 10 01 04 50 INFO mapreduce ImportJobBase Beginning import of stock_analysis 13 12 10 01 04 56 INFO mapred JobClient 13 12 10 01 04 57 INFO mapred JobClient 13 12 10 01 05 42 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient slots ms 0 13 12 10 01 05 45 INFO mapred JobClient reserving slots ms 0 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient 13 12 10 01 05 45 INFO mapred JobClient
79. 06 hadoop core jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop examples 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop examples jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop minicluster 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop test 1 2 0 1 3 1 0 06 jar apps dist hadoop 1 2 0 1 3 1 0 06 hadoop test jar apps dist hadoop 1 2 0 1 3 1 0 06 hadoop tools 1 2 0 1 3 1 0 06 jar apps dist hadoop 1 2 0 1 3 1 0 06 hadoop tools jar apps dist hadoop 1 2 0 1 3 1 0 06 lib apps dist hadoop 1 2 0 1 3 1 0 06 lib jsp 2 1 c apps dist log4jetwappender microsoft log4j etwappender 1 0 jar org apache hadoop mapred JobTracker DateTime 2013 11 24T06 35 12 0190000Z Times tamp 3610354520 HadoopServiceTraceSource Information 0 ServiceHost OnStart DateTime 2013 11 24T06 35 12 0346250Z Timestamp 3610410266 HadoopServiceTraceSource Information 0 Child process started PID 4976 DateTime 2013 11 24T06 35 12 0346250Z Times tamp 3610428330 OO OO o a E DD DD D Apart from the trace file Hadoop has built in logging mechanisms implementing the log4j framework The following JobTracker log files are located in the C apps dist hadoop 1 2 0 1 3 1 0 06 logs folder e hadoop jobtracker lt Hostname gt log e hadoop tasktracker lt Hostname gt log e hadoop historyserver lt Hostname gt log These files record the actual execution status of the MapReduce jobs Listing 13 6 shows an excerpt of
80. 07 2013 31 47 31 6 31 4 31 54 28870700 31 54 NASDAQ MSFT 26 07 2013 31 26 31 62 31 21 31 62 38633600 31 62 NASDAQ MSFT 25 07 2013 31 62 31 65 31 25 31 39 63213000 31 39 NASDAQ MSFT 24 07 2013 32 04 32 19 31 89 31 96 52803100 31 96 NASDAQ MSFT 23 07 2013 31 91 32 04 31 71 31 82 65810400 31 82 NASDAQ MSFT 22 07 2013 31 7 32 01 31 6 32 01 79040700 32 01 NASDAQ It is very important to note that Hive queries use minimal caching statistics or optimizer tricks They generally read the entire data set on each execution and thus are more suitable for batch processing than for online work One of the strongest recommendations I have for you while you are querying Hive is to write SELECT instead of listing specific column names Fetching a selective list of columns like in Listing 8 9 is a best practice when the source is a classic database management system like SQL Server database but the story is completely different with Hive Listing 8 9 Selecting a partial list of columns SELECT stock symbol stock volume FROM stock_analysis The general principle of HIVE is to expose Hadoop MapReduce functionality through an SQL like language Thus when you issue a command like that in Listing 8 9 a MapReduce job will be triggered to remove any columns from the Hive table data set that aren t being specified in the query and to send back only the columns stock_symbol and stock_volume On the other hand the HiveQL in Listing 8 10 does not require any MapRedu
81. 1 15 11 56 49 332 INFO ql Driver Driver java getSchema 259 Returning Hive schema Schema fieldSchemas null properties null 2013 11 15 11 56 49 332 INFO ql Driver Perflogger java PerfLogEnd 127 lt PERFLOG method compile start 1384516609326 end 1384516609332 duration 6 gt 2013 11 15 11 56 49 332 INFO ql Driver PerfLogger java PerflogBegin 100 lt PERFLOG method Driver execute gt 2013 11 15 11 56 49 333 INFO ql Driver Driver java execute 1066 Starting command create database test 2013 11 15 11 56 49 333 INFO ql Driver Perflogger java PerfLogEnd 127 lt PERFLOG method TimeToSubmit start 1384516609326 end 1384516609333 duration 7 gt 2013 11 15 11 56 49 871 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method Driver execute start 1384516609332 end 1384516609871 duration 539 gt 2013 11 15 11 56 49 872 INFO ql Driver SessionState java printInfo 423 OK 2013 11 15 11 56 49 872 INFO ql Driver PerfLogger java PerfLogBegin 100 lt PERFLOG method releaseLocks gt 2013 11 15 11 56 49 872 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method releaseLocks start 1384516609872 end 1384516609872 duration 0 gt 2013 11 15 11 56 49 872 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method Driver run start 1384516609326 end 1384516609872 duration 546 gt 2013 11 15 11 56 49 873 INFO CliDriver SessionState java printInfo 423 227 CHAPTER 13
82. 1 24 06 35 14 362 INFO org apache hadoop http HttpServer Jetty bound to port 50030 2013 11 24 06 35 16 264 INFO org apache hadoop mapred JobTracker Setting safe mode to false Requested by hdp 2013 11 24 06 35 16 329 INFO org apache hadoop util NativeCodeLoader Loaded the native hadoop library 2013 11 24 06 35 16 387 INFO org apache hadoop mapred JobTracker Cleaning up the system directory 2013 11 24 06 35 17 172 INFO org apache hadoop mapred JobHistory Creating DONE folder at wasb democlustercontainer democluster blob core windows net mapred history done 2013 11 24 06 35 17 536 INFO org apache hadoop mapred JobTracker History server being initialized in embedded mode 2013 11 24 06 35 17 555 INFO org apache hadoop mapred JobHistoryServer Started job history server at 0 0 0 0 50030 Adding a new node fd0 ud0 workernodeo 2013 11 24 06 35 18 363 INFO org apache hadoop mapred JobTracker Adding tracker tracker_workernode0 127 0 0 1 127 0 0 1 49186 to host workernodeo 2013 11 24 06 35 19 083 INFO org apache hadoop net NetworkTopology Adding a new node fd1 ud1 workernode1 2013 11 24 06 35 19 094 INFO org apache hadoop mapred JobTracker Adding tracker tracker_workernode1 127 0 0 1 127 0 0 1 49193 to host workernode1 2013 11 24 06 35 19 365 INFO org apache hadoop mapred CapacityTaskScheduler Initializing joblauncher queue with cap 25 0 maxCap 25 0 ulMin 100 ulMinFactor 100 0 supportsPriorities false maxJobsToIni
83. 12 09 13 12 09 13 12 09 13 12 09 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 33 33 33 33 33 33 33 242 242 242 242 243 244 ER 34 34 34 34 34 34 34 34 03 05 07 07 07 07 07 07 INFO WARN INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO input FileInputFormat Total input paths to process 1 snappy LoadSnappy Snappy native library is available util NativeCodeLoader Loaded the native hadoop library snappy mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred after reserving slots ms 0 13 12 09 22 34 07 INFO reserving slots ms 0 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34 bytes 1029046272 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO 13 12 09 22 34 07 INFO 13 12 09 22 34 07 INFO 13 12 09 22 34 07 INFO 13 12 09 22 34 07 INFO 13 12 09 22 34 07 INFO 94 mapred mapred mapred mapred mapred mapred mapr
84. 12 10 01 37 42 INFO mapred JobClient map 94 reduce 0 13 12 10 01 37 43 INFO mapred JobClient map 99 reduce 0 13 12 10 01 37 45 INFO mapred JobClient map 100 reduce 0 13 12 10 01 37 50 INFO mapred JobClient Job complete job_201311240635 0206 13 12 10 01 37 50 INFO mapred JobClient Counters 20 13 12 10 01 37 50 INFO mapred JobClient Job Counters 13 12 10 01 37 50 INFO mapred JobClient SLOTS MILLIS MAPS 151262 13 12 10 01 37 50 INFO mapred JobClient Total time spent by all reduces waiting after reserving slots ms 0 13 12 10 01 37 50 INFO mapred JobClient Total time spent by all maps waiting after reserving slots ms 0 13 12 10 01 37 50 INFO mapred JobClient Rack local map tasks 4 13 12 10 01 37 50 INFO mapred JobClient Launched map tasks 4 100 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE 13 12 10 01 37 50 INFO mapred JobClient SLOTS MILLIS REDUCES 0 13 12 10 01 37 50 INFO mapred JobClient File Output Format Counters 13 12 10 01 37 50 INFO mapred JobClient Bytes Written 0 13 12 10 01 37 50 INFO mapred JobClient FileSystemCounters 13 12 10 01 37 50 INFO mapred JobClient WASB_BYTES_READ 3027416 13 12 10 01 37 50 INFO mapred JobClient FILE_BYTES_READ 3696 13 12 10 01 37 50 INFO mapred JobClient HDFS_BYTES_READ 792 13 12 10 01 37 50 INFO mapred JobClient FILE_BYTES_WRITTEN 296608 13 12 10 01 37 50 INFO mapred JobClient File Input Format Counters 13 12 10 01 37 50 INFO mapred JobClient Bytes Read 0 13 12 10
85. 199 INFO org apache hadoop hdfs util GSet 2 max memory 72 81875 MB 2013 08 16 21 32 40 199 INFO org apache hadoop hdfs util GSet capacity 2 23 8388608 entries 2013 08 16 21 32 40 199 INFO org apache hadoop hdfs util GSet recommended 8388608 actual 8388608 2013 08 16 21 32 40 245 INFO org apache hadoop hdfs server namenode FSNamesystem fsOwner hdp 2013 08 16 21 32 40 245 INFO org apache hadoop hdfs server namenode FSNamesystem supergroup supergroup 2013 08 16 21 32 40 245 INFO org apache hadoop hdfs server namenode FSNamesystem isPermissionEnabled false 2013 08 16 21 32 40 261 INFO org apache hadoop hdfs server namenode FSNamesystem dr block invalidate limit 100 2013 08 16 21 32 40 261 INFO org apache hadoop hdfs server namenode FSNamesystem isAccessTokenEnabled false accessKeyUpdateInterval 0 min s accessTokenLifetime 0 min s 2013 08 16 21 32 40 292 INFO org apache hadoop hdfs server namenode FSNamesystem Registered FSNamesystemStateMBean and NameNodeMXBean 2013 08 16 21 32 40 355 INFO org apache hadoop hdfs server namenode FSEditLog dfs namenode edits toleration length 0 2013 08 16 21 32 40 355 INFO org apache hadoop hdfs server namenode NameNode Caching file names occuring more than 10 times 2013 08 16 21 32 40 386 INFO org apache hadoop hdfs server namenode FSEditLog Read length 4 2013 08 16 21 32 40 386 INFO org apache hadoop hdfs server namenode FSEditLog Corruption length 0 2013 08 16 21 32 40 386 INFO org apache h
86. 2 10 2013 2 48 AM Text Document OKB __ hadoop test RD00155D67172B log 12 10 2013 2 47 AM Text Document 0 KB Figure 11 3 Hadoop Log4j logs A few of the supporting projects like Hive also support the Log4j framework They have these logs in their own log directory similar to Hadoop Following is a snippet of my Hive server log files running on democluster HiveMetaStore java main 2940 Starting hive metastore on port 9083 2013 08 16 21 24 32 437 INFO metastore HiveMetaStore HiveMetaStore java newRawStore 349 0 Opening raw store with implemenation class org apache hadoop hive metastore ObjectStore 2013 08 16 21 24 32 469 INFO mortbay log S1f4jLog java info 67 Logging to org slf4j impl Log4jLoggerAdapter org mortbay log via org mortbay log S1f4jLog 2013 08 16 21 24 32 515 INFO metastore ObjectStore ObjectStore java initialize 206 ObjectStore initialize called 2013 08 16 21 24 32 578 INFO metastore HiveMetaStore HiveMetaStore java newRawStore 349 0 Opening raw store with implemenation class org apache hadoop hive metastore ObjectStore 2013 08 16 21 24 32 625 INFO metastore ObjectStore ObjectStore java initialize 206 ObjectStore initialize called HiveMetaStore java startMetaStore 3032 Starting DB backed MetaStore Server 2013 08 16 21 24 40 090 INFO metastore HiveMetaStore HiveMetaStore java startMetaStore 3044 Started the new metaserver on port 9083 2013 08 16 21 24 40 090 INFO metastore Hive
87. 2 12 1108 GMT NORMAL admin streamjob6737947396646342963 jar 100 00 2 2 100 00 2013 ob 201309161139 0003 die 2 OMT NORMAL admin TempletonControllerJob 100 00 1 1 100 00 1 2013 Figure 5 4 JobTracker portal 67 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER You can click on the job in the portal to further drill down into the details of the operation as shown in Figure 5 5 Job Setup Successful Status Succeeded Started at Mon Sep 16 12 11 51 GMT 2013 Finished at Mon Sep 16 12 12 37 GMT 2013 Finished in 45sec Job Cleanup Successful Job Scheduling information 0 running map tasks using 0 map slots 0 additional slots reserved 0 running reduce tasks using 0 reduce slots Kind Complete Num Tasks Pending Running Complete Killed netted map 100 00 d o d 1 0 0 0 reduce 100 00 o o d o 0 0 0 Counter Map Reduce Total SLOTS_MILLIS_MAPS 0 0 40 297 Total time spent by all reduces waiting after reserving slots ms 0 0 0 Job Counters Total time spent by all maps waiting after reserving slots ms 0 0 0 Launched map tasks 0 0 4 SLOTS_MILLIS_REDUCES 0 0 0 File Output Format Counters Bytes Written 0 0 0 File Input Format Counters Bytes Read 0 0 0 FILE_BYTES_READ 1 064 o 1 064 FileSystemCounters HDFS_BYTES_READ 45 0 45 ASV_BYTES_WRITTEN 242 0 242 FILE_BYTES_WRITTEN 28 669 0 28 669 Map
88. 20 Combine input records 251357 SPLIT RAW BYTES 161 Reduce input records 32956 Reduce input groups 32956 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE 13 12 09 22 34 07 INFO mapred JobClient Combine output records 32956 13 12 09 22 34 07 INFO mapred JobClient Physical memory bytes snapshot 493834240 13 12 09 22 34 07 INFO mapred JobClient Reduce output records 32956 13 12 09 22 34 07 INFO mapred JobClient Virtual memory bytes snapshot 1430384640 13 12 09 22 34 07 INFO mapred JobClient Map output records 251357 Note The jobs you execute from the NET and PowerShell programs are broken down internally as similar commands and executed as command line jobs Make sure that the output files are created in the commandlineoutput folder as provided in the MapReduce command by issuing another 1s command This command lists the output file s created by the job as in Listing 6 4 Listing 6 4 Verifying the output c apps dist hadoop 1 2 0 1 3 1 0 06 gt hdfs fs ls example data commandlineoutput Found 1 items rw r r 1 hadoopusersupergroup 337623 2013 12 09 22 34 example data commandlineoutput part r 00000 You can copy output to the local file system and inspect the results occurrences for each word will be in c output part r 00000 using the command in Listing 6 5 Listing 6 5 Copying the MapReduce output from HDFS to local file system hadoop dfs copyToLocal example data commandlineoutput c output You can use Windows
89. 2013 11 16 17 28 43 363 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec 2013 11 16 17 28 44 376 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec 2013 11 16 17 28 45 388 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec 2013 11 16 17 28 46 395 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec 2013 11 16 17 28 47 401 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec 2013 11 16 17 28 48 409 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec 2013 11 16 17 28 49 416 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec 2013 11 16 17 28 50 423 Stage 1 map 100 reduce 33 Cumulative CPU 3 093sec 2013 11 16 17 28 51 429 Stage 1 map 100 reduce 33 Cumulative CPU 3 093sec 2013 11 16 17 28 52 445 Stage 1 map 100 reduce 100 Cumulative CPU 5 514 sec 2013 11 16 17 28 53 453 Stage 1 map 100 reduce 100 Cumulative CPU 5 514 sec 2013 11 16 17 28 54 462 Stage 1 map 100 reduce 100 Cumulative CPU 5 514 sec MapReduce Total cumulative CPU time 5 seconds 514 msec Ended Job job_201311120315_0003 MapReduce Jobs Launched Job 0 Map 1 Reduce 1 Cumulative CPU 5 514 sec HDFS Read 245 HDFS Write 6 SUCCESS Total MapReduce CPU Time Spent 5 seconds 514 msec OK 59793 Time taken 48 899 seconds Fetched 1 row s As we see from the preceding output the job that is created is job_201311120315_0003 Now take a look at the folder C apps dist hadoop 1 2 0 1 3 0 1 0302 logs In
90. 54b ExitCode 0 Name Query gt export connect jdbc sqlserver lt Server gt database windows net username debarchans lt Server gt password lt Password gt database sqoopdemo table stock_analysis export dir user hadoopuser example data StockAnalysis input fields terminated by State Completed SubmissionTime 12 10 2013 1 36 36 AM Cluster democluster PercentComplete map 100 reduce 0 JobId job_201311240635_0205 99 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE D python27 python exe can t open file bin hcat py Errno 2 No such file or directory 13 12 10 01 36 48 INFO manager SqlManager Using default fetchSize of 1000 13 12 10 01 36 48 INFO tool CodeGenTool Beginning code generation 13 12 10 01 36 52 INFO manager SqlManager Executing SQL statement SELECT t FROM stock_analysis AS t WHERE 1 0 13 12 10 01 36 53 INFO orm CompilationManager HADOOP_MAPRED_HOME is c hdfs mapred local taskTracker admin jobcache job_201311240635 0205 attempt_ 201311240635 0205 _m_000000_O work C apps dist hadoop 1 2 0 1 3 1 0 06 13 12 10 01 36 53 WARN orm CompilationManager HADOOP_MAPRED HOME appears empty or missing Note tmp sqoop hdp compile c2070a7782f921c6cdOcfd58ab7efe66 stock_analysis java uses or overrides a deprecated API Note Recompile with Xlint deprecation for details 13 12 10 01 36 54 INFO orm CompilationManager Writing jar file tmp sqoop hdp compile c2070a7782f921c6cdOcfd5
91. 6 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Project chararray 0 scope 54 LEVELS New For Each false bag scope 52 POUserFunc org apache pig builtin REGEX_EXTRACT chararray scope 50 Cast chararray scope 47 Project bytearray 0 scope 46 Constant TRACE DEBUG INFO WARN ERROR TOTAL scope 48 Constant 1 scope 49 LOGS Load wasb democlustercontainer democluster blob core windows net sample log org apache pig builtin PigStorage scope 45 MapReduce node scope 58 Map Plan FILTEREDLEVELS Store fakefile org apache pig builtin PigStorage scope 57 FILTEREDLEVELS Filter bag scope 53 Not boolean scope 56 POIsNull boolean scope 55 Project chararray 0 scope 54 ei LEVELS New For Each false bag scope 52 POUserFunc org apache pig builtin REGEX_EXTRACT chararray scope 50 Cast chararray scope 47 Project bytearray 0 scope 46 Constant TRACE DEBUG INFO WARN ERROR TOTAL scope 48 Constant 1 scope 49 LOGS Load wasb democlustercontainer democluster blob core windows net sample log org apache pig builtin PigStorage scope 45 Global sort false 237 CHAPTER 13 TROUBLESHOOTING JOB FAILURES The EXPLAIN operator s output is segmented into three sections e L
92. 7 28 29 953 INFO org apache hadoop mapred JobInProgress tip task_201311120315_0003_m_000000 has split on node fd0 ud0 localhost 2013 11 16 17 28 29 953 INFO org apache hadoop mapred JobInProgress job_201311120315 0003 LOCALITY WAIT FACTOR 0 25 2013 11 16 17 28 29 953 INFO org apache hadoop mapred JobInProgress Job job_201311120315_0003 initialized successfully with 1 map tasks and 1 reduce tasks 2013 11 16 17 28 29 966 INFO org apache hadoop mapred JobTracker Adding task JOB SETUP attempt_201311120315 0003 m_000002_0 to tip task_201311120315 0003 m 000002 for tracker tracker_workernode2 127 0 0 1 127 0 0 1 49175 2013 11 16 17 28 37 865 INFO org apache hadoop mapred JobInProgress Task attempt_201311120315 0003 _m_000002_0 has completed task _201311120315 0003 m 000002 successfully 2013 11 16 17 28 37 869 INFO org apache hadoop mapred JobInProgress Choosing a non local task task_201311120315 0003 _m_000000 2013 11 16 17 28 37 870 INFO org apache hadoop mapred JobTracker Adding task MAP attempt_201311120315 0003 m 000000 0 to tip task 201311120315 0003 m 000000 for tracker tracker_workernode2 127 0 0 1 127 0 0 1 49175 2013 11 16 17 28 39 710 INFO org apache hadoop mapred JobInitializationPoller Removing scheduled jobs from waiting queuejob_201311120315 0003 2013 11 16 17 28 42 118 INFO org apache hadoop mapred JobInProgress Task attempt_201311120315 0003 _m_000000_0 has completed task _201311120315 0003 m 00
93. 85be017 Compiled Tue Oct 22 13 40 30 Pacific Daylight Time 2013 by jenkins Upgrades There are no upgrades in progress Browse the filesystem Namenode Logs Cluster Summary 12 files and directories 1 blocks 13 total Heap Size is 119 06 MB 3 56 GB 3 Configured Capacity 465 47 GB DFS Used 0 42 KB Non DFS Used 134 33 GB DFS Remaining 331 14 GB DFS Used 0 DFS Remaining 71 14 Live Nodes z VE eng Dead Nodes 0 Decommissioning Nodes 0 Number of Under Replicated Blocks 0 Figure 7 5 HDInsight Emulator Name Node portal Note If you get errors launching the portal make sure that the Apache Hadoop services are running through the Windows Services console Start gt Run gt Services msc The deployment of core Hadoop and the supporting projects is done in the C Hadoop directory by the emulator Note that this path is slightly different C apps Dist directory in the case of the actual Azure HDInsight service As of this writing the Emulator ships version 1 6 of HDInsight service which is HDP version 1 1 This is going to get updated periodically as and when the new versions of core Hadoop and HDP are tested and ported to the Windows platform When you navigate to the C Hadoop directory you should see a folder hierarchy similar to Figure 7 6 118 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR Include in library v Share with v Burn New folder Se DN e Name i Date modified Type Si
94. 8ab7efe66 stock_analysis jar 13 12 10 01 36 54 INFO mapreduce ExportJobBase Beginning export of stock_analysis 13 12 10 01 36 58 INFO input FileInputFormat Total input paths to process 1 13 12 10 01 36 58 INFO input FileInputFormat Total input paths to process 1 13 12 10 01 36 58 WARN snappy LoadSnappy Snappy native library is available 13 12 10 01 36 58 INFO util NativeCodeLoader Loaded the native hadoop library 13 12 10 01 36 58 INFO snappy LoadSnappy Snappy native library loaded 13 12 10 01 36 58 INFO mapred JobClient Running job job_201311240635_0206 13 12 10 01 37 00 INFO mapred JobClient map 0 reduce 0 13 12 10 01 37 21 INFO mapred JobClient map 10 reduce 0 13 12 10 01 37 22 INFO mapred JobClient map 16 reduce 0 13 12 10 01 37 23 INFO mapred JobClient map 21 reduce 0 13 12 10 01 37 27 INFO mapred JobClient map 27 reduce 0 13 12 10 01 37 28 INFO mapred JobClient map 32 reduce 0 13 12 10 01 37 30 INFO mapred JobClient map 41 reduce 0 13 12 10 01 37 33 INFO mapred JobClient map 46 reduce 0 13 12 10 01 37 34 INFO mapred JobClient map 55 reduce 0 13 12 10 01 37 35 INFO mapred JobClient map 63 reduce 0 13 12 10 01 37 36 INFO mapred JobClient map 71 reduce 0 13 12 10 01 37 37 INFO mapred JobClient map 77 reduce 0 13 12 10 01 37 39 INFO mapred JobClient map 82 reduce 0 13 12 10 01 37 40 INFO mapred JobClient map 85 reduce 0 13 12 10 01 37 41 INFO mapred JobClient map 88 reduce 0 13
95. 9999999 2485100 197 IS IBM 25 07 2013 196 30000000000001 197 83000000000001 195 66 197 22 3014300 197 Il IBM 24 07 2013 195 94999999999999 197 30000000000001 195 86000000000001 196 61000000000001 2957900 196 IBM 23 07 2013 194 21000000000001 196 43000000000001 194 09999999999999 194 97999999999999 2863800 194 E IBM 22 07 2013 193 40000000000001 195 78999999999999 193 28 194 09 3398000 194 E IBM 19 07 2013 197 91 197 99000000000001 193 24000000000001 193 53999999999999 6997600 193 GI 18 07 2013 198 27000000000001 200 94 195 99000000000001 197 99000000000001 8393400 197 oO IBM 17 07 2013 194 72 194 88999999999999 193 03 194 55000000000001 6868400 194 D IBM 16 07 2013 194 03999999999999 194 58000000000001 192 68000000000001 193 84999999999999 3745300 193 EI IBM 15 07 2013 192 41999999999999 194 88999999999999 191 68000000000001 194 5674700 194 B IBM 12 7 2013 193 06999999999999 193 69999999999999 191 59 192 06999999999999 4494700 192 g IBM 11 7 2013 193 78 194 11000000000001 192 61000000000001 192 80000000000001 4177500 192 Al IBM 10 7 2013 191 50999999999999 193 43000000000001 191 27000000000001 192 25 3892000 192 IBM 9 7 2013 191 88 192 80000000000001 190 78 191 30000000000001 5204200 191 E IBM 8 7 2013 195 59999999999999 195 78 194 61000000000001 194 97999999999999 2947500 194 m IBM 5 7 2013 194 49000000000001 195 16 192 34999999999999 194 93000000000001 2405400 194 E Figure 9 10 The PowerPivot data model Change the data type
96. AC_MSIL PresentationFramework Aero2 v4 0_4 0 0 0 31bf3856ad364e35 PresentationFramework Aero2 d11 Auto detected C windows Microsoft Net assembly GAC_MSIL PresentationFramework SystemXml v4 0 4 0 0 0 b77a5c561934e089 PresentationFramework SystemXml d11 Auto detected C windows Microsoft Net assembly GAC_MSIL PresentationFramework SystemCore v4 0 4 0 0 0 b77a5c561934e089 PresentationFramework SystemCore dll Auto detected C windows Microsoft Net assembly GAC_MSIL PresentationFramework SystemData v4 0 4 0 0 0 b77a5c561934e089 PresentationFramework SystemData dll Auto detected D HadoopClient HadoopClient bin Release microsoft hadoop client d1l Auto detected D HadoopClient HadoopClient bin Release microsoft hadoop mapreduce d11l Auto detected D HadoopClient HadoopClient bin Release microsoft hadoop webclient d11l Auto detected D HadoopClient HadoopClient bin Release Newtonsoft Json dll Auto detected D HadoopClient HadoopClient bin Release HadoopClient d1l1 Auto detected C windows Microsoft Net assembly GAC_MSIL PresentationFramework SystemXmlLing v4 0 4 0 0 0 b77a5c561934e089 PresentationFramework SystemXmlLing d1l Auto detected C windows Microsoft Net assembly GAC_MSIL UIAutomationClient v4 0 4 0 0 0 31bf3856ad364e35 UIAutomationClient d11 Auto detected C windows Microsoft Net assembly GAC_MSIL PresentationUI v4 0 4 0 0 0 31bf3856ad364e35 PresentationUI d11 Auto detected C windows Microsoft Net assembly GAC_MSIL ReachFr
97. AsPlainText Force myCr eds New Object System Management Automation PSCredential admin Ssecpasswd N 14 Define the word count MapReduce job 15 SmapReduceJobDefinition New Azur eHDInsightMapReduceJobDefinition JarFile jarFile ClassNan 17 Submit the MapReduce job 18 Select AzureSubscription subscription 19 wordCountJob Start AzureHDInsightJob Cluster Scluster JobDefinition mapReduceJobDefiniti 21 Wait for the job to complete 22 Wait AzureHDInsightJob Job SwordCountJob WaitTimeoutInSeconds 3600 Credential myCreds 23 24 Get the job standard error output 25 Get AzureHDInsightJobOutput Cluster cluster JobId SwordCountJob JobId StandardError Subs 26 27 Get the blob content 28 Get AzureStorageBlobContent Container Container Blob example data WordCountOutputPS part r 29 30 List the content of the output file 31 cat example data WordCountOutputPS part r 00000 findstr human PS D HadoopClient gt Figure 5 6 Windows PowerShell ISE The entire script can be saved as a PowerShell script file ps1 for later execution Listing 5 14 shows the complete script Listing 5 14 PowerShell job submission script subscription Your Subscription Name cluster democluster storageAccountName democluster Container democlustercontainer storageAccountKey Get AzureStorageKey storageAccountName _ Primary storageContext New AzureStorageContext StorageAccountName
98. CertificateCredential Constants subscriptionId cert Constants clusterName Once the credentials are created it is time to create a JobSubmissionClient object and call the MapReduce job based on the definition Create a hadoop client to connect to HDInsight var jobClient JobSubmissionClientFactory Connect creds 69 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Run the MapReduce job JobCreationResults mrJobResults jobClient CreateMapReduceJob mrJobDefinition Console Write Executing WordCount MapReduce Job Wait for the job to complete WaitForJobCompletion mrJobResults jobClient The final step after the job submission is to read and display the stream of output from the blob storage The following piece of code does that Stream stream new MemoryStream CloudStorageAccount storageAccount CloudStorageAccount Parse DefaultEndpointsProtocol https AccountName Constants storageAccount AccountKey Constants storageAccountKey CloudBlobClient blobClient storageAccount CreateCloudBlobClient CloudBlobContainer blobContainer blobClient GetContainerReference Constants container CloudBlockBlob blockBlob blobContainer GetBlockBlobReference example data WordCountOutput part r 00000 blockBlob DownloadToStream stream stream Position 0 StreamReader reader new StreamReader stream Console Write Done Word counts are n Console WriteLine reader ReadT
99. D mapSlotSeconds 8 reduceSlotsSeconds 9 clusterMapCapacity 16 clusterReduceCapacity 8 jobName select count from hivesampletable Stage 1 2013 11 16 17 28 53 790 INFO org apache hadoop mapred JobQueuesManager Job job_201311120315 0003 submitted to queue default has completed 2013 11 16 17 28 53 791 INFO org apache hadoop mapred JobTracker Removing task attempt_201311120315 0003_m_000000_0 2013 11 16 17 28 53 791 INFO org apache hadoop mapred JobTracker Removing task attempt_201311120315 0003_m_000001_0 2013 11 16 17 28 53 791 INFO org apache hadoop mapred JobTracker Removing task attempt_201311120315_0003_m_000002_0 2013 11 16 17 28 53 792 INFO org apache hadoop mapred JobTracker Removing task attempt_201311120315 0003 r 000000_0 2013 11 16 17 28 53 815 INFO org apache hadoop mapred JobHistory Creating DONE subfolder at wasb democlustercontainer democluster blob core windows net mapred history done version 1 jobtrackerhost_1384226104721_ 2013 11 16 000000 2013 11 16 17 28 53 978 INFO org apache hadoop mapred JobHistory Moving file c apps dist hadoop 1 2 0 1 3 0 1 0302 logs history job_201311120315 0003 1384622907254 desarkar_ select count 28 20F 29 fromthivesampletable 28Stage 1 29_default_ 20F to wasb testhdi democluster blob core windows net mapred history done version 1 jobtrackerhost_1384226104721_ 2013 11 16 000000 2013 11 16 17 28 54 322 INFO org apache hadoop mapred JobHistory Moving file c app
100. DInsight servers data Name Location State data 9 data SDPHDI1 East US Running data democluster North Europe Running data datadork West US Running data tutorial West US Running info hdinsight cluster list command OK You can use the azure hdinsight cluster delete lt ClusterName gt command to delete any existing cluster To create a new cluster using the CLI you need to provide the cluster name subscription information and other details similar to provisioning a cluster using PowerShell or the NET SDK Listing 4 6 shows a sample command to create a new HDiInsight cluster using CLI Listing 4 6 Creating a Cluster Using CLI azure hdinsight cluster create clusterName lt ClusterName gt storageAccountName lt StorageAccountName gt storageAccountKey lt storageAccountKey gt storageContainer lt StorageContainer gt nodes lt NumberOfNodes gt location lt DataCenterLocation gt username lt HDInsightClusterUsername gt clusterPassword lt HDInsightClusterPassword gt Typically you provision an HDInsight cluster run jobs on it and then delete the cluster to cut down the cost The command line interface also gives you the option to save the configurations into a file so that you can reuse it every time you provision a cluster This is basically another way of automating cluster provisioning and several other administrative tasks For comprehensive reference documentation
101. Debugger Beginning of Application Main Program cs line 25 Exception Thrown Sequence contains no matching element System InvalidOperationException Debugger Stopped at Exception First Debugger Exception Intercepted CreateCluster Program cs line 51 Debugger Step Recorded CreateCluster Program cs line 56 Debugger Step Recorded CreateCluster Program cs line 57 Debugger Step Recorded CreateCluster Program cs line 58 Debugger Step Recorded CreateCluster Program cs line 59 Debugger Step Recorded CreateCluster Program cs line 60 Debugger Step Recorded CreateCluster Program cs line 61 Debugger Step Recorded CreateCluster Program cs line 62 Debugger Step Recorded CreateCluster Program cs line 63 Debugger Step Recorded CreateCluster Program cs line 64 Exception Thrown Object reference not set to an instance of an object System NullReferenceException Debugger Stopped at Exception CreateCluster Program cs line 64 Live Event Exception Intercepted CreateCluster Program cs line 64 An exception was intercepted and the call stack unwound to the point before the call from user code where the exception occurred Unwind the call stack on unhandled exceptions is selected in the debugger options Thread Main Thread 14316 Related views Calls View Locals Call Stack Figure 12 5 IntelliTrace events window If you opt to trace function call sequences while enabling Intel
102. E command to examine the structure of relation or variable B Relation B has two fields The first field is named group and is of type tuple The second field is name a after relation A and is of type bag Note A variable is also called a relation in Pig Latin terms Sqoop Jobs Sqoop is the bi directional data transfer tool between HDFS again WASB in Azure HDInsight service and relational databases In an HDInsight context Sqoop is primarily used to import and export data to and from SQL Azure databases and the cluster storage When you run a Sqoop command Sqoop in turn runs a MapReduce task in the Hadoop Cluster map only and no reduce task There is no separate log file specific to Sqoop So you need to troubleshoot a Sqoop failure or performance issue pretty much the same way as a MapReduce failure or performance issue 238 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Windows Azure Storage Blob The underlying storage infrastructure for Azure is known as Windows Azure Blob Storage WABS Microsoft has implemented a thin wrapper that exposes this blob storage as the HDFS file system for HDInsight This is referred to as Windows Azure Storage Blob WASB and is a notable change in Microsoft s Hadoop implementation on Windows Azure As you saw throughout the book Windows Azure Storage Blob WASB replaces HDFS and is the storage for your HDInsight clusters by default It is important to understand the WASB issues you may encounter
103. For example Microsoft can analyze Facebook posts or Twitter sentiments to determine how Windows 8 1 its latest operating system has been accepted in the industry and the community Big Data solutions can parse huge unstructured data sources such as posts feeds tweets logs and www allitebooks com CHAPTER 1 INTRODUCING HDINSIGHT so forth and generate intelligent analytics so that businesses can make better decisions and correct predictions Figure 1 2 summarizes the thought process How do better predict O O Future outcomes e ORK op OG a SOCIAL amp WEB ANALYTICS LIVE DATA FEEDS ADVANCED ANALYTICS How do optimize my fleet based on weather and traffic patterns What s the social sentiment for my brands and product Figure 1 2 A process for determining whether you need Big Data The next step in evaluating an implementation of any business process is to know your existing infrastructure and capabilities well Traditional RDBMS solutions are still able to handle most of your requirements For example Microsoft SQL Server can handle 10s of TBs whereas Parallel Data Warehouse PDW solutions can scale up to 100s of TBs of data If you have highly relational data stored in a structured way you likely don t need Big Data However both SQL Server and PDW appliances are not good at analyzing streaming text or dealing with large numbers of attributes or JSON Also typical Big Data solutions use a scale
104. GING IN HDINSIGHT Table 11 1 Log files available in HDInsight on Azure Log File Name Location Service Node namenode trace log C apps dist Hadoop Name Node Cluster name node hadoop 1 2 0 1 3 1 0 06 bin Service datanode trace log C apps dist Hadoop Data Node Any of the cluster hadoop 1 2 0 1 3 1 0 06 bin Service data nodes secondarynamenode C apps dist Hadoop Secondary Cluster secondary trace log hadoop 1 2 0 1 3 1 0 06 bin Name Node Service name node tasktracker trace log C apps dist Hadoop taskTracker Any of the cluster hadoop 1 2 0 1 3 1 0 06 bin Service data nodes hiveserver trace log C apps dist hive Hive Thrift Service Cluster node 0 11 0 1 3 1 0 06 bin running Hive hiveserver2 trace log C apps dist hive Hive Server 2 with Cluster node 0 11 0 1 3 1 0 06 bin concurrent connection running Hive support metastore trace log C apps dist hive Hive Meta Store Service Cluster node 0 11 0 1 3 1 0 06 bin Hive storage running Hive Derbyserver trace log C apps dist hive Hive Derby Server Cluster node 0 11 0 1 3 1 0 06 bin Service Hive native running Hive storage oozieservice out log C apps dist Oozie Service Cluster node Templeton trace log oozie 3 3 2 1 3 1 0 06 Service C apps dist hcatalog 0 11 0 1 3 1 0 06 bin Templeton Service running Oozie Cluster node running Templeton Figure 11 1 will help you correlate the services to the startup logs that are listed in Table 11 1 The
105. Hadoop services running on the name node you can do that from the Configuration tab as shown in Figure 3 18 34 www allitebooks com CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER democluster ZA DASHBOARD MONITOR CONFIGURATION SDPHDI1 demociuster SS cluster connectivity HADOOP SERVICES on OFF tutorial admin USER NAME Figure 3 18 Configuring Hadoop services Hadoop services are turned on by default You can click the OFF button to stop the services in the name node You can also enable Remote Desktop access to your name node from the Configuration screen Do that through the ENABLE REMOTE button at the bottom of this screen as shown in Figure 3 19 Figure 3 19 Enable Remote Desktop Once you click on ENABLE REMOTE you get an option to configure a remote user Specify the password and a date when the remote access permission expires The expiration is for security reasons It forces you to periodically visit this configuration screen and extend the remote access privilege so that it doesn t remain past when it is needed Figure 3 20 shows the remote user configuration screen 35 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER Configure Remote Desktop USER NAME debarchan PASSWORD CONFIRM PASSWORD EXPIRES ON 2013 11 22 Figure 3 20 Configure Remote Desktop Once Remote Desktop is configured for the cluster you should see status messages similar to those
106. However it is a good practice to have the skeleton class for the Reducer which derives from the ReducerCombinerBase NET Framework class as shown in Listing 5 3 You can write your code in the overridden Reduce method later if you need to implement any reduce operations Listing 5 3 SquareRootReducer cs using System using System Collections Generic using System Ling using System Text using Microsoft Hadoop MapReduce namespace HadoopClient class SquareRootReducer ReducerCombinerBase public override void Reduce string key IEnumerable lt string gt values ReducerCombinerContext context throw new NotImplementedException 63 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Note Windows Azure MSDN documentation has a sample C wordcount program that implements both the Mapper and Reducer classes http www windowsazure com en us documentation articles hdinsight sample csharp streaming Once the Mapper and Reducer classes are defined you need to implement the HadoopJob class This consists of the configuration information for your job for example the input data and the output folder path Listing 5 4 shows the code snippet for the SquareRootJob class implementation Listing 5 4 SquareRootJob cs using System using System Collections Generic using System Ling using System Text using Microsoft Hadoop MapReduce namespace HadoopClient class SquareRootJob Hadoop
107. InsightClient Connect creds var clusters client ListClusters foreach var item in clusters Console WriteLine Cluster 0 Nodes 1 item Name item ClusterSizeInNodes Following are the first two lines of code They connect to the X509 certificate store in read only mode var store new X509Store store Open OpenFlags ReadOnly Next is a statement to load the Azure certificate based on the thumbprint var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint After loading the certificate our next step is to create a client object based on the credentials obtained from the subscription ID and the certificate We do that using the following statement var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds Then we enumerate the HDInsight clusters under the subscription The following lines grab the cluster collection and loops through each item in the collection var clusters client ListClusters foreach var item in clusters Console WriteLine Cluster 0 Nodes 1 item Name item ClusterSizeInNodes The WriteLine call within the loop prints the name of each cluster and its respective nodes You can run this code to list out your existing clusters in a console window You need to add a call to this ListClusters function in your Main method and run t
108. J democuste bicd core windows netisemaclustercontsiner c 16 Bni 24 Now 2013 07 36 11 GMT nitps J Oemocuster Ep core windows nevisemaciustercentainer c 17 Bn zf Sun 24 Nov 2023 07 48 22 GMT nips J democuster Biot core windows netisemaciustercontairer c 18 Bins eet Sun 24 Nov 2013 07 45 22 GMT nips Semocuster ER core windows necisemaciustercentainer c 19 Bind n 24 Now 2033 07 49 42 GMT e hiap demecuster Bick core windows netdermaduater taire 20 Fe u 24 Now 2023 06 36 13 GMT t pp Semocuster diod core windowt net Semociustercontainer 21 bing 24 Now 2013 06 36 27 GMT e https J democuster bica core windows metisemaclustercontainer e 22 Wind Cr trol Sun 24 Nov 2013063631 GMT wg tps J emocuster E core windows nesisemaciustercontainer e 2 age ee eet Sun 24 Now 2023 06 96 34 GMT rut https J democuster bod core window netiidemaclustercentainer e 2 zen eet Sun 24 Now 2013 06 36 16 GMT https Semacuster bica core windows neticemactustercentainer e 3 gutenderg moll Sun 24 Now 2013 06 36 20GMT al P tz Semociuster Bid core windows met Semociustercontainer x devirci tt Di not Sum 24 Nov 2013 06 36 20 GMT e htp demecuster bioa core windows net Semoduster container e 7 ouesneoticience txt Di net Sun 24 Now 2023 06 36 20 GMT Sec hipe demecuster bich core windows nan demschustarcentainer t 2 deep Di ze Sun 24 Now 2013 06 36 20 GMT ze Seen https oemocuster E core windows meRisemaciustercentainer m
109. Job lt SquareRootMapper gt public override HadoopJobConfiguration Configure ExecutorContext context var config new HadoopJobConfiguration InputPath Constants wasbPath example data Numbers txt OutputFolder Constants wasbPath example data SqaureRootOutput return config Note chose example data as the input path where would have my source file Numbers txt The output will be generated in the example data SquareRootOutput folder This output folder will be overwritten each time the job runs If you want to preserve an existing job output folder make sure to change the output folder name each time before job execution Per the configuration option specified in the job class you need to upload the input file Numbers txt and the job will write the output data to a folder called SquareRootOutput in Windows Azure Storage Blob WASB This will be the example data directory of the democlustercontainer in the democluster storage account as specified by the constant wasbPath in the Constants cs class 64 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Running the MapReduce Job Just before running the job you need to upload the input file Numbers txt to the storage account Here is the content of a sample input file 100 200 300 400 500 Use the PowerShell script shown in Listing 5 5 to upload the file to your blob container Listing 5 5 Using PowerShell to upload a file subscri
110. MP RESULT On successful execution of Pig statements you should see output where the log entries are grouped by their values and arranged based on their number of occurrences Such output is shown in Listing 6 13 Listing 6 13 The Pig job output Input s Successfully read 1387 records 404 bytes from wasb example data sample log Output s Successfully stored 6 records in wasb democlustercontainer democluster blob core windows net tmp temp167788958 tmp 1711466614 102 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Counters Total records written 6 Total bytes written 0 Spillable Memory Manager spill count 0 Total bags proactively spilled 0 Total records proactively spilled 0 Job DAG job_201311240635 0221 gt job_201311240635_0222 job_201311240635_0222 gt job_201311240635_0223 job_201311240635_0223 2013 12 10 02 24 01 797 main INFO org apache pig backend hadoop executionengine mapReduceLayer MapReduceLauncher Success 2013 12 10 02 24 01 800 main INFO org apache pig data SchemaTupleBackend Key pig schematuple was not set will not generate code 2013 12 10 02 24 01 825 main INFO org apache hadoop mapreduce lib input FileInputFormat Total input paths to process 1 2013 12 10 02 24 01 825 main INFO org apache pig backend hadoop executionengine util MapRedUtil Total input paths to process 1 TRACE 816 DEBUG 434 INFO 96 WARN 11 ERROR 6 FATAL 2 M
111. MetaStore HiveMetaStore java startMetaStore 3046 193 CHAPTER 11 LOGGING IN HDINSIGHT Options minWorkerThreads 200 2013 08 16 21 24 40 090 INFO metastore HiveMetaStore HiveMetaStore java startMetaStore 3048 Options maxWorkerThreads 100000 2013 08 16 21 24 40 091 INFO metastore HiveMetaStore HiveMetaStore java startMetaStore 3050 TCP keepalive true 2013 08 16 21 24 40 104 INFO metastore HiveMetaStore HiveMetaStore java logInfo 392 1 get_databases default 2013 08 16 21 24 40 123 INFO metastore HiveMetaStore Logging initialized using configuration in file C apps dist hive 0 9 0 conf hive log4j properties 2013 08 16 21 25 03 078 INFO ql Driver PerfLogger java PerfLogBegin 99 lt PERFLOG method Driver run gt 2013 08 16 21 25 03 078 INFO ql Driver PerfLogger java PerflogBegin 99 lt PERFLOG method compile gt 2013 08 16 21 25 03 145 INFO parse ParseDriver ParseDriver java parse 427 Parsing command DROP TABLE IF EXISTS HiveSampleTable 2013 08 16 21 25 03 445 INFO parse ParseDriver ParseDriver java parse 444 Parse Completed 2013 08 16 21 25 03 541 INFO hive metastore HiveMetaStoreClient java open 195 Trying to connect to metastore with URI thrift headnodehost 9083 2013 08 16 21 25 03 582 INFO hive metastore HiveMetaStoreClient java open 209 Connected to metastore 2013 08 16 21 25 03 604 INFO metastore HiveMetaStore HiveMetaStore java logInfo 392
112. NPKG UNZIP source C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg zip WINPKG UNZIP destination C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources WINPKG UNZIP unzipRoot C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg WINPKG Unzip of C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg zip to C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources succeeded WINPKG UnzipRoot C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg WINPKG C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg scripts install ps1 credentialFilePath c hadoop singlenodecreds xml HADOOP Logging to existing log C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log HADOOP Logging to C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log HADOOP HDP_INSTALL_PATH C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg scripts HADOOP HDP_RESOURCES DIR C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources hadoop 1 1 0 SNAPSHOT winpkg resources HADOOP nodeInstallRoot c hadoop HADOOP hadoopInstallToBin c hadoop hadoop 1 1 0 SNAPSHOT bin HADOOP Reading credentials from c hadoop singlenodecreds xml HADOOP Username PUMBAA hadoop HADOOP C
113. Name Node and check progress from JobTracker portal with the returned JobID IHadoop hadoop Hadoop Connect Constants azureClusterUri Constants clusterUser Constants hadoopUser Constants clusterPassword Constants storageAccount Constants storageAccountKey Constants container true var output hadoop MapReduceJob ExecuteJob lt SquareRootJob gt Finally add a call to the DoCustomMapReduce method from your Main function The Main function in your Program cs file should now look like Listing 5 7 Listing 5 7 Main method static void Main string args d ListClusters CreateCluster DeleteCluster DoCustomMapReduce Console ReadKey 66 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Execute the HadoopClient project and your console output should display messages similar to the following Starting MapReduce job Log in remotely to your Name Node and check progress from JobTracker portal with the returned JobID File dependencies to include with job Auto detected D HadoopClient HadoopClient bin Debug HadoopClient vshost exe Auto detected D HadoopClient HadoopClient bin Debug HadoopClient exe Auto detected D HadoopClient HadoopClient bin Debug Microsoft Hadoop MapReduce d11 Auto detected D HadoopClient HadoopClient bin Debug Microsoft Hadoop WebClient d11 Auto detected D HadoopClient HadoopClient bin Debug Newtonsoft Json dll D D D D D Auto detected D Ha
114. State Completed SubmissionTime 11 24 2013 7 08 25 AM Cluster democluster PercentComplete JobId job_201311240635_0002 Logging initialized using configuration in file C apps dist hive 0 11 0 1 3 1 0 06 conf hive log4j properties OK Time taken 22 438 seconds You can verify the structure of the schema you just created using the script in Listing 8 5 Listing 8 5 Verifying the Hive schema subscriptionName YourSubscriptionName clustername democluster Select AzureSubscription SubscriptionName subscriptionName Use AzureHDInsightCluster clusterName Subscription Get AzureSubscription Current SubscriptionId querystring DESCRIBE stock_analysis Invoke Hive Query querystring This should display the structure of the stock_analysis table as shown here Successfully connected to cluster democluster Submitting Hive query Started Hive query with jobDetails Id job 201311240635 0004 Hive query completed Successfully stock_symbol string None stock_date string None stock price open double None stock price high double None stock price low double None stock price close double None stock_volume int None stock price adi close double None exchange string None Partition Information col_name data type comment exchange string None Now that you have the Hive schema ready you can start loading the stock data in your stock_analysis table 133 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC
115. THE EXPERT S VOICE IN BIG DATA ee Pro Microsoft HDinsight Hadoop on Windows YOUR COMPLETE GUIDE TO DEPLOYING AND USING APACHE HADOOP ON THE MICROSOFT WINDOWS AND WINDOWS AZURE PLATFORMS Debarchan Sarkar TTT Apress www allitebooks com For your convenience Apress has placed some of the front matter material after the index Please use the Bookmarks and Contents at a Glance links to access them mea Apress www allitebooks com Contents at a Glance About E E xiii About the Technical Reviewers sssssssssunsnnnnunnnnnnunnnnnnunnnnnnnnnnnnnannnnnnnnnnnnnannnnnnnnnnnnnannnnnnnnnann a XV Acknowledgments ssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nannan xvii Introduction E xix Chapter 1 Introducing HDINSIght ssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 1 Chapter 2 Understanding Windows Azure HDInsight ServiCe scccsssssssssssssseeees 13 Chapter 3 Provisioning Your HDinsight Service Cluster ssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 23 Chapter 4 Automating HDinsight Cluster Provisioning sssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 39 Chapter 5 Submitting Jobs to Your HDInsight Cluster ssssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 59 Chapter 6 Exploring the HDinsight Name Node ssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 89 Chapter 7 Using Windows Azure HDIns
116. TION SERVICES Am HiveConsumer Property Pages _ 2 mE Som Configuration Active Development x rry N A Configuration Manager gt Common Properties 4 Data Flow Optimizations 4 Configuration Properties RunInOptimizedMode False Build 4 Debug Options Deployment InteractiveMode True Debugging Run64BitRuntime False 4 Start Action StartAction ExecutePackage StartApplication StartObjectID lt Active Package gt 4 Start Options CmdLineArguments Run64BitRuntime Specifies whether the project should start 64 bit SSIS runtime If 64 bit SSIS runtime is not installed this setting is ignored B A Figure 10 21 Running in 32 bit mode You can now schedule this package as a SQL Server job and run the data load on a periodic basis You also might want to apply some transformation to the data before it loads into the target SQL warehouse to clean it or to apply necessary business logic using the inbuilt SSIS Data Flow Transformation components There are other programmatic ways through which you can initiate a Hadoop job from SSIS For example you can develop your own custom SSIS components using NET and use them to automate Hadoop jobs A detailed description on this approach can be found on the following MSDN whitepaper http msdn microsoft com en us library jj720569 aspx Summary In this chapter you had a brief look into SQL Server and its Business Intelligence components You also developed a sample packa
117. TROUBLESHOOTING JOB FAILURES Time taken 0 548 seconds 2013 11 15 11 56 49 874 INFO ql Driver PerfLogger java PerfLogBegin 100 lt PERFLOG method releaseLocks gt 2013 11 15 11 56 49 874 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method releaseLocks start 1384516609874 end 1384516609874 duration 0 gt The highlighted entries in Listing 13 8 are the regions you should be looking at if you wish to see the chain of events while executing your CREATE DATABASE Hive job Other entries are helpful in the event of an error Say for example you try to create a database that already exists The attempt would fail You would then look for entries in the log file such as those highlighted in Listing 13 9 Listing 13 9 hive log file showing HQL errors 2013 11 15 13 37 11 432 INFO ql Driver PerfLogger java PerflogBegin 100 lt PERFLOG method Driver run gt 2013 11 15 13 37 11 433 INFO ql Driver PerfLogger java PerflogBegin 100 lt PERFLOG method TimeToSubmit gt 2013 11 15 13 37 11 433 INFO ql Driver PerfLogger java PerfLogBegin 100 lt PERFLOG method compile gt 2013 11 15 13 37 11 434 INFO parse ParseDriver ParseDriver java parse 179 Parsing command create database test 2013 11 15 13 37 11 434 INFO parse ParseDriver ParseDriver java parse 197 Parse Completed 2013 11 15 13 37 11 435 INFO ql Driver Driver java compile 442 Semantic Analysis Completed 2013 11 15 13 37 11 436
118. UserProfile gt Downloads lt SubscriptionName gt credentials publishsettings 43 CHAPTER A AUTOMATING HDINSIGHT CLUSTER PROVISIONING You should see a message in the PowerShell prompt about setting your default subscription The message will be similar to the following VERBOSE Setting lt subscription_name gt as the default and current subscription To view other subscriptions use Get AzureSubscription Next execute the Get AzureSubscription command to list your subscription details as shown next Note the thumbprint that is generated you will be using this thumbprint further throughout your NET solution PS C gt Get AzureSubscription SubscriptionName lt subscription_name gt SubscriptionId lt subscription_Id gt ServiceEndpoint https management core windows net ActiveDirectoryEndpoint ActiveDirectoryTenantId IsDefault True Certificate Subject CN Windows Azure Tools Issuer CN Windows Azure Tools Serial Number 793EE9285FF3D4A84F4F6B73994F 3696 Not Before 12 4 2013 11 45 00 PM Not After 12 4 2014 11 45 00 PM Thumbprint lt Thumbprint gt CurrentStorageAccountName CurrentCloudStorageAccount ActiveDirectoryUserId Once this is done you are ready to code your Visual Studio application Note The publishsettings file contains sensitive information about your subscription and credentials Care should be taken to prevent unauthorized access to this file It is highly recommended that y
119. VICES F Enable Geo Replication CREATE STORAGE ACCOUNT vV Figure 3 3 Storage account details Ifyou wish Windows Azure can geo replicate your Windows Azure Blob and Table data at no additional cost between two locations hundreds of miles apart within the same region for example between North and South US between North and West Europe and between East and Southeast Asia Geo replication is provided for additional data durability in case of a major data center disaster Select the Enable Geo Replication check box if you want that functionality enabled Then click on CREATE STORAGE ACCOUNT to complete the process of adding a storage account Within a minute or two you should see the storage account created and ready for use in the portal as shown in Figure 3 4 25 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER stora ge NAME STATUS datadork wf Online hadooponcioud VW Online hdidemo wf Online hdinsightstorage Online Figure 3 4 The democluster storage account Note Enabling geo replication later for a storage account that has data in it might have a pricing impact on the subscription Creating a SQL Azure Database When you actually provision your HDInsight cluster you also get the option of customizing your Hive and Oozie data stores In contrast to the traditional Apache Hadoop HDInsight gives you the option of selecting a SQL Azure database for storing the metadata for Hive and Oozie This section quic
120. You can also execute MapReduce jobs using the command line Listing 7 4 shows you a sample job you can trigger from the Hadoop command prompt Listing 7 4 Using the Hadoop command line hadoop jar hadoop examples jar wordcount example data WordCountOutputEmulator example data gutenberg davinci txt Note You need to have the hadoop examples jar file at the root of your Blob container to execute the job successfully As with the Azure service the recommended way to submit and execute MapReduce jobs is through the NET SDK or the PowerShell cmdlets You can refer to Chapter 5 for such job submission and execution samples there are very minor changes like the cluster name which is your local machine when you are using the emulator Listing 7 5 shows a sample PowerShell script you can use for your MapReduce job submissions Listing 7 5 MapReduce PowerShell script creds Get Credential cluster http localhost 50111 inputPath wasb democlustercontainer democluster blob core windows net example data gutenberg davinci txt outputFolder wasb democlustercontainer democluster blob core windows net example data WordCountOutputEmulatorPs jar wasb democlustercontainer democluster blob core windows net hadoop examples jar className wordcount hdinsightJob New AzureHDInsightMapReduceJobDefinition JarFile jar ClassName className Arguments inputPath outputPath Submit the MapReduce job
121. abase in SQL Azure You will later use this database as metadata storage for Hive and Oozie when you provision your HDInsight cluster Deploying Your HDinsight Cluster Now that you have your dedicated storage account ready select the HDINSIGHT option in the portal and click on CREATE AN HDINSIGHT CLUSTER as shown in Figure 3 7 hdinsight You have no HDInsight clusters Create one to get started Figure 3 7 Create new HDInsight cluster 27 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER Click on QUICK CREATE to bring up the cluster configuration screen Provide the name of your cluster choose the number of data nodes and select the storage account democluster that was created earlier as the default storage account for your cluster as shown in Figure 3 8 You must also provide a cluster user password The password must be at least 10 characters long and must contain an uppercase letter a lowercase letter a number and a special character CLUSTER NAME COMPUTE 9 SQL DATABASE 5 QUICK CREATE democluster o azurehdinsight net DATA SERVICES STORAGE y CUSTOM CREATE CLUSTER SIZE SUBSCRIPTION APP SERVICES E HDINSIGHT 4 data nodes vi emm NETWORK SERVICES CLUSTER USER NAME ADMIN CONFIRM PASSWORD ei RECOVERY SERVICES a J seveneeceseeee STORAGE ACCOUNT STORE democluster y CREATE HDINSIGHT CLUSTER V Figure 3 8 HDInsight cluster details Note You can select the number of data nodes betwe
122. ace Collects IntelliTrace events only which has minimal effect on performance General Advanced IntelliTrace events and call information IntelliTrace Events Modules gt Performance Tools gt Database Tools FF Tools gt HTML Designer Package Manager SQL Server Tools gt Text Templating Web Performance Test Tools Windows Forms Desianer m Collects call information which can degrade application performance Figure 12 4 Enabling IntelliTrace While you re debugging IntelliTrace collects data about a managed application in the background including information from many framework components such as ADO NET ASP NET and Hadoop NET classes When you break into the debugger you are immediately presented with a sequential list of the IntelliTrace events that were collected In your HadoopClient solution if there is an error for which the cluster creation fails you should see the errors in the sequence of events in the IntelliTrace events window as shown in Figure 12 5 213 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS All Categories All Threads zs Search D DI Exception Thrown The message filter indicated that the application is busy Exception from HRESULT 0x8001010A RPC_E_SERVERCALL_RETRYLATER System Runtir D Exception Caught The message filter indicated that the application is busy Exception from HRESULT 0x8001010A RPC_E_SERVERCALL_RETRYLATER System Runtir
123. acker DEBUG 10g4j logger org apache hadoop fs FSNamesystem DEBUG log4j logger org apache hadoop metrics2 hadoop metrics log level Set the warning level to WARN to avoid having info messages leak to the console log4j logger org mortbay log WARN 195 CHAPTER 11 LOGGING IN HDINSIGHT The file is commented to make it easier for you to set the logging levels As you can see in the preceding code example you can set the log levels to WARN to stop logging generic INFO messages You can opt to log messages only in the case of debugging for several services like JobTracker and TaskTracker To further shrink the logs you can also set the logging level to ERR to ignore all warnings and worry only in the case of errors There are other properties of interest as well especially those that control the log rollover retention period maximum file size and so on as shown in the following snippet Roll over at midnight log4j appender DRFA DatePattern yyyy MM dd 30 day backup 1og4j appender DRFA MaxBackupIndex 30 log4j appender DRFA layout org apache log4j PatternLayout Default values hadoop tasklog taskid null hadoop tasklog iscleanup false hadoop tasklog noKeepSplits 4 hadoop tasklog totalLogFileSize 100 hadoop tasklog purgeLogSplits true hadoop tasklog logsRetainHours 12 Simple settings like these can really help you control log file growth and avoid certain problems in the future You have limited contr
124. adoop hdfs server namenode FSEditLog Toleration length 0 dfs namenode edits toleration length 2013 08 16 21 32 40 386 INFO org apache hadoop hdfs server namenode FSEditLog Summary Read 4 Corrupt 0 Pad 0 2013 08 16 21 32 41 855 INFO org apache hadoop http HttpServer Port returned by webServer getConnectors 0 getLocalPort before open is 1 Opening the listener on 50070 2013 08 16 21 32 41 855 INFO org apache hadoop http HttpServer listener getLocalPort returned 50070 webServer getConnectors 0 getLocalPort returned 50070 2013 08 16 21 32 41 855 INFO org apache hadoop http HttpServer Jetty bound to port 50070 2013 08 16 21 32 42 527 INFO org apache hadoop hdfs server namenode NameNode Web server up at namenodehost 50070 2013 08 16 21 32 42 558 INFO org apache hadoop ipc Server IPC Server listener on 9000 starting 2013 08 16 21 32 42 574 INFO org apache hadoop ipc Server IPC Server handler 1 on 9000 starting 2013 08 16 21 32 42 574 INFO org apache hadoop ipc Server IPC Server handler 7 on 9000 starting 2013 08 16 21 32 42 574 INFO org apache hadoop ipc Server IPC Server handler 5 on 9000 starting The log gives you important information like the host name the port number on which the web interfaces listen and a lot of other storage related information that could be useful while troubleshooting a problem In the case of an authentication problem with the data nodes you might see error
125. age account container and you should be able to see the files uploaded as shown in Figure 8 3 NAME URL LAST MODIFIED TableApple csy http demoduster biob core windows net demociustercontainer TableApple csy 11 23 2013 10 49 2 TableFacebook csv http demociuster biob core windows net democlustercontainer TableFacebook csv 11 23 2013 10 43 09 TableGoogle csv http demociuster blob core windows net democlustercontainer TableGoogle csv 11 23 2013 10 42 24 Tablel3M csv http demociuster blob core windows net democlustercontainer TablelBM csv 11 23 2013 10 42 43 TableMSFT csv http demociuster biob core windows net demociustercontainer TableMSFT csv 11 23 2013 10 43 23 TableOracle csv http demociuster biob core windows net democlustercontainer TableOracie csv 11 23 2013 10 42 57 Tablefacebook csv httpy demociuster blob core windows net democlustercontainer Tablefacebook csv 11 23 2013 10 41 25 Figure 8 3 The democlustercontainer Note that the files are uploaded to the root directory To make it more structured we will copy the stock data files into the StockData folder With Remote Desktop open the Hadoop Command Line and execute commands shown in Listing 8 3 Listing 8 3 Copying the data files to StockData folder hdfs fs cp TableApple csv debarchan StockData tableApple csv hdfs fs cp TableFacebook csv debarchan StockData tableFacebook csv hdfs fs cp TableGoogle csv debarchan StockData tableGoogle csv hdfs fs cp TableIBM csv
126. age and computation which makes sense But an obvious question that comes up regarding this architecture is this Will this setup have a bigger network bandwidth cost The apparent answer seems to be Yes because the data in WASB is no longer local to the compute nodes However the reality is a little different Overall when using WASB instead of HDFS you should not encounter performance penalties HDInsight ensures that the Hadoop cluster and storage account are co located in the same flat data center network segment This is the next generation data center networking architecture also referred to as the Quantum 10 Q10 network architecture Q10 architecture flattens the data center networking topology and provides full bisection bandwidth between compute and storage Q10 provides a fully nonblocking 10 Gbps based fully meshed network providing an aggregate backplane in excess of 50 Tbps of bandwidth for each Windows Azure data center Another major improvement in reliability and throughput is moving from a hardware load balancer to a software load balancer This entire architecture is based on a research paper by Microsoft and the details can be found here http research microsoft com pubs 80693 v12 sigcommo9 final pdf 20 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE In the year 2012 Microsoft deployed this flat network for Windows Azure across all of the data centers to create Flat Network Storage FNS The r
127. ains multiple implementations of HadoopJob lt gt you need to indicate the one you wish to run MRRunner dll MyD11l class MyClass 85 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER To supply additional configuration options to your job you need to pass them as trailing arguments on the command line after a double hyphen MRRunner dll M ll class MyClass extraArg1 extraArg2 These additional arguments are provided to your job via a context object that is available to all methods on HadoopJob lt gt When you develop a project using the NET SDK the MRRunner utility will be automatically deployed in a folder called MRLib in your project directory as illustrated in Figure 5 7 It is basically a Windows executable exe file uter Data E HadoopClient HadoopClient MRLib ein library v Share with v Burn New folder Name Date modified Type Size E HiveDriver exe 8 27 2013 5 14 PM Application 30 KB 2 Microsoft Hadoop Client dll 8 27 2013 5 11 PM Application extens 13 KB m Microsoft Hadoop CombineDriver exe 8 27 2013 5 11 PM Application 14 KB 5 Microsoft Hadoop MapDriver exe 8 27 2013 5 11 PM Application 14 KB 2 Microsoft Hadoop MapReduce dil 8 27 2013 5 11PM Application extens 82 KB Microsoft Hadoop ReduceDriver exe 8 27 2013 5 11 PM Application 14 KB 2 Microsoft WindowsAzure ManagementF 8 27 2013 5 14 PM Application extens 20 KB E MRRunner exe lt _ 8 27 2013 5 11PM
128. al he sought he wanted to do a full book without a single screen shot He promises his next book will be fiction or a collection of poetry but that has yet to transpire Scott Klein is a Microsoft Data Platform Technical Evangelist who lives and breathes data His passion for data technologies brought him to Microsoft in 2011 which has allowed him to travel all over the globe evangelizing SQL Server and Microsoft s cloud data services Prior to Microsoft Scott was one of the first 4 SQL Azure MVPs and even though those don t exist anymore he still claims it Scott has authored several books that talk about SQL Server and Windows Azure SQL Database and continues to look for ways to help people and companies grok the benefits of cloud computing He also thinks grok is an awesome word In his spare time what little he has Scott enjoys spending time with his family trying to learn German and has decided to learn how to brew root beer without using the extract He recently learned that data scientists are sexy so he may have to add that skill to his toolbelt XV Acknowledgments This book benefited from a large and wide variety of people ideas input and efforts I d like to acknowledge several of them and apologize in advance to those I may have forgotten I hope you guys will understand My heartfelt and biggest THANKS perhaps is to Andy Leonard AndyLeonard for his help on this book project Without Andy th
129. am AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509Certificate E HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509 Certificate G HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509 Certificate G HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography x509Certificate G _HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509Certificate D HadoopClient Program AnonymousMethod System Security Cryptography x509Certificates X509Certificate2 item System Security Cryptography X509Certificate D System Ling Enumerable First System Collections Generic lEnumerable source unknown System Func predicate unknown E Debugger Exception Intercepted CreateCluster Program cs line 51 Microsoft WindowsAzure Management HDInsight ClusterProvisioning Data HDInsightClusterCreationDetails ctor Debugger Step Recorded CreateCluster Program cs line 56 Microsoft WindowsAzure Management HDInsight ClusterProvisioning Data HDinsightClusterCreationDetails set_Name Debugger Step Recorded CreateCluster Program cs line 57 Mi
130. amework v4 0_4 0 0 0 31bf3856ad364e35 ReachFramework dll Al Kol a Lac Job job_201309210954 0193 completed The MRRunner command can be put in a Windows batch file bat or a command file cmd and scheduled in Windows Task Scheduler to execute it on a periodic basis Of course there are plenty of other ways as well to automate MRRunner operations Summary One of the major benefits of using the Azure HDInsight service is the elasticity it provides in terms of spinning up clusters and running jobs exactly when they are required The basic idea behind this is to avoid preserving idle clusters just for storage In HDInsight the ultimate goal will be to present a script or a program that demonstrates how you can provide a DLL and have the script bring a cluster online run your job and then remove the cluster while allowing you to specify the cluster name and the number of hosts needed to run the job There are various ways you can provision a new cluster with the simplest of them being the Management portal and it s easy to use intuitive graphical user interface But as requirements become more and more complex and unpredictable along with project budget limitations automating and parameterizing cluster provisioning and job submissions become a necessity You can also provision cluster and configure it to connect to more than one Azure Blob storage or custom Hive and Oozie metastores This advanced feature allows you to separate lifet
131. and analytical applications faster and share and collaborate on insights more easily using the familiar environments of Excel and SharePoint PowerPivot comes as an add in to Excel 2013 and Excel 2010 that allows business users to work with data from any source and syndication including Open Data Protocol ODATA feeds to create business models and integrate large amounts of data directly into Excel workbooks Sophisticated workbooks can be built using Excel only or using the PowerPivot model as a source of data from other BI tools These BI tools can include third party tools as well as the new Power View capability discussed later in this chapter to generate intelligent and interactive reports These reports can be published to SharePoint Server and then shared easily across an organization The following section explains how to generate a PowerPivot data model based on the stock_analysis Hive table created earlier using the Microsoft Hive ODBC Driver We have used Excel 2013 for the demos Open a new Excel worksheet and make sure you turn on the required add ins for Excel as shown in Figure 9 1 You ll need those add ins enabled to build the samples used throughout this chapter Go to File gt Options gt Add ins In the Manage drop down list click COM Add ins gt Go and enable the add ins 147 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS Add Ins available Inquire Load Test Report Addin Load Test Report Addi
132. ands using the command line and about the different unique Windows services for Hadoop You also had a look at the different supporting projects like Hive Sqoop and Pig and how they can be invoked from the command line as well as from PowerShell scripts Finally we navigated through the installation files and folder hierarchies of Hadoop and the other projects in the C apps dist directory of the name node 111 CHAPTER 7 Using Windows Azure HDInsight Emulator Deploying your Hadoop clusters on Azure invariably incurs some cost The actual cost of deploying a solution depends on the size of your cluster the data you play with and certain other aspects but there are some bare minimum expenses for even setting up a test deployment for evaluation For example you will at least need to pay for your Azure subscription in order to try the HDInsight service on Azure This is not acceptable for many individuals or institutions who want to evaluate the technology and then decide on an actual implementation Also you need to have a test bed to test your solutions before deploying them to an actual production environment To address these scenarios Microsoft offers the Windows Azure HDInsight Emulator The Windows Azure HDInsight Emulator is an implementation of HDInsight on the Windows Server family The emulator is currently available as a Developer Preview where the Hadoop based services on Windows use only a single node deployment HDInsi
133. apReduce 2 0 also known as MRv2 YARN is a subproject of Hadoop at the Apache Software Foundation that was introduced in Hadoop 2 0 It separates the resource management and processing components It provides a more generalized processing platform that is not restricted to just MapReduce Configuration Files There are two key configuration files that have the various parameters for MapReduce jobs These files are located in the path C apps dist hadoop 1 2 0 1 3 1 0 06 conf of the NameNode e core site xml e mapred site xml core site xml This file contains configuration settings for Hadoop Core such as I O settings that are common to Windows Azure Storage Blob WASB and MapReduce It is used by all Hadoop services and clients because all services need to know how to locate the NameNode There will be a copy of this file in each node running a Hadoop service This file has several key elements of interest particularly because the storage infrastructure has moved to WASB instead of being in Hadoop Distributed File System HDFS which used to be local to the data nodes For example in your democluster you should see entries in your core site xml file similar to Listing 13 1 Listing 13 1 WASB detail lt property gt lt name gt fs default name lt name gt lt cluster variant gt lt value gt wasb democlustercontainer democluster blob core windows net lt value gt lt description gt The name of the default fil
134. apred JobClient SPLIT_RAW_BYTES 87 13 12 10 01 05 45 INFO mapreduce ImportJobBase Transferred 0 bytes in 54 0554 seconds 0 bytes sec 13 12 10 01 05 45 INFO mapreduce ImportJobBase Retrieved 36153 records Windows PowerShell also provides cmdlets to execute Sqoop jobs The following PowerShell script in Listing 6 10 exports the same StockAnalysis blob from WASB to a SQL Azure database called ExportedData Listing 6 10 The Sqoop export PowerShell script subscriptionName Your Subscription Name clusterName democluster SqoopCommand export connect jdbc sqlserver lt Server gt database windows net username debarchans lt Server gt password lt Password gt database sqoopdemo table stock_analysis export dir user hadoopuser example data StockAnalysis input fields terminated by sqoop New AzureHDInsightSqoopJobDefinition Command SqoopCommand SqoopJob Start AzureHDInsightJob Subscription Get AzureSubscription Current SubscriptionId Cluster clustername JobDefinition sqoop Wait AzureHDInsightJob Subscription Get AzureSubscription Current SubscriptionId Job SqoopJob WaitTimeoutInSeconds 3600 Get AzureHDInsightJobOutput Cluster clusterName JobId SqoopJob JobId StandardError Subscription subscriptionName Successful execution of the Sqoop export job shows output similar to Listing 6 11 Listing 6 11 The PowerShell Sqoop export output StatusDirectory ee84101c 98ac 4a2b ae3d 49600eb59
135. arams E Connection Managers 4 SSIS Packages Package dtsx Miscellaneous Figure 10 3 Package dtsx 169 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES An SSIS solution is a placeholder for a meaningful grouping of different SSIS workflows It can have multiple projects in this solution you have only one HiveConsumer and each project in turn can have multiple SSIS packages in this project you have only one Package dtsx implementing specific data load jobs Creating the Data Flow As discussed earlier a data flow is an SSIS package component used for moving data across different sources and destinations In this package to move data from Hive to SQL Server you first need to create a data flow task in the package that contains the source and destination components to transfer the data Double Click the Package dtsx created above in the SSDT solution to open the designer view To create a data flow task double click or drag and drop a data flow task from the toolbox on the left side of the pane This places a data flow task in the Control Flow canvas of the package as shown in Figure 10 4 SSRI 090g gt 4 Favorites a Control Flow Jea Data Flow lg Parameters F Event Handlers tg Package Explorer 3 Data Flow Task US Execute SQL Task 4 Common B Analysis Services Pro Te Bulk Insert Task Data Profiling Task L Execute Package Task C Execute Process Task J Expr
136. arios after the map jobs are over most of the nodes go idle with only a few nodes working for the reduce jobs to complete To make reduce jobs finish fast you can increase the number of reducers to match the number of nodes or the total number of processor cores Following is the SET command you use to configure the number of reducers launched from a Hive job set mapred reduce tasks lt number gt Implement Map Joins Map joins in Hive are particularly useful when a single huge table needs to be joined with a very small table The small table can be placed into memory in a distributed cache by using map joins By doing that you avoid a good deal of disk IO The SET commands in Listing 13 13 enable Hive to perform map joins and cache the small table in memory Listing 13 13 Hive SET options set hive auto convert join true set hive mapjoin smalltable filesize 40000000 Another important configuration is the hive mapjoin smalltable filesize setting By default it is 25 MB and if the smaller table exceeds this size all of your original MapJoin tests revert back to common joins In the preceding snippet I have overridden the default setting and set it to 40 MB Note There are no reducers in map joins because such a join can be completed during the map phase with a lot less data movement You can confirm that map joins are happening if you see the following e With a map join there are no reducers because the join happen
137. astore service is starting The Apache Hadoop metastore service was started successfully Wait 10s for metastore db setup Starting hiveserver The Apache Hadoop hiveserver service is starting The Apache Hadoop hiveserver service was started successfully Starting hiveserver2 The Apache Hadoop hiveserver2 service is starting The Apache Hadoop hiveserver2 service was started successfully Starting Oozie service Starting oozieservice Waiting for service to start Oozie Service started successfully The Apache Hadoop templeton service is starting The Apache Hadoop templeton service was started successfully Note Any service startup failures will also be displayed in the console You may need to navigate to the respective log files to investigate further 123 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR Using the Emulator Working with the emulator is no different from using the Azure service except for a few nominal changes Specifically if you modified the core site xml file to point to your Windows Azure Blob Storage there are very minimal changes to your Hadoop commands and MapReduce function calls You can always use the Hadoop Command Line to execute your MapReduce jobs For example to list the contents of your storage Blob container you can fire the 1s command as shown in Listing 7 3 Listing 7 3 Executing the ls command hadoop fs ls wasb democlustercontainer democluster blob core windows net
138. at Sl Paste Get External Refresh PivotTable Clear All Sort by Data gt 9 58 26 Filters Column Clipboard Formatting Sort and Filter Reset Layout PS OR av eo Ee OR E stock_price_open E stock_price_high E stock_price_low E stock_price_close E stock_volume E stock_price_adj_close E exchange If Average of stock price 7 Figure 9 15 Creating the relation P gt Autosum Find Data iE Create KPI View ix Calculation Area Find Calculations gt E DayNumberOfWeek E EnglishDayNameOfWeek I SpanishDayNameOfWeek E FrenchDayNameOfweek E DayNumberOfMonth E DayNumberOfYear E WeekNumberOfyear E EnglishMonthName E SpanishMonthName FrenchMonthName E MonthNumberOfYear E CalendarQuarter Click on the Create Hierarchy button in the DimDate table Create a hierarchy for CalendarYear CalendarQuarter EnglishMonthName and FullDateAlternateKey as shown in Figure 9 16 Drag and drop these four columns under the hierarchy s HDate value 157 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS stock_analysis f stock_symbol E DateKey E stock date E FullDateAlternateKey E stock_price_open E DayNumberOfWeek M stock_price_high EnglishDayNameOfweek E stock_price_low M SpanishDayNameOfWweek D stock_price_close E FrenchDayNameOfWeek E stock_volume E DayNumberOfMonth D stock_price_adj_close E DayNumberOfYear E exchange E WeekNumberOfYear E EnglishMont
139. at it can be used to augment the results of analysis and reporting processes Following are some examples e Social data log files sensors and applications that generate data files e Datasets obtained from Windows Data Market and other commercial data providers e Streaming data filtered or processed through SQL Server StreamInsight Note Microsoft Streaminsight is a Complex Event Processing CEP engine The engine uses custom generated events as its source of data and processes them in real time based on custom query logic standing queries and events The events are defined by a developer user and can be simple or quite complex depending on the needs of the business You can use the following techniques to integrate output from HDInsight with enterprise BI data at the report level These techniques are revisited in detail throughout the rest of this book e Download the output files generated by HDInsight and open them in Excel or import them into a database for reporting e Create Hive tables in HDInsight and consume them directly from Excel including using Power Pivot or from SQL Server Reporting Services SSRS by using the Simba ODBC driver for Hive 11 CHAPTER 1 INTRODUCING HDINSIGHT e Use Sqoop to transfer the results from HDInsight into a relational database for reporting For example copy the output generated by HDInsight to a Windows Azure SQL Database table and use Windows Azure SQL Reporting Services to crea
140. ate corrective actions Compress Job Output Hadoop is intended for storing large data volumes so compression becomes a mandatory requirement You can choose to compress your MapReduce job output by adding the following two parameters in your mapred site xml file mapred output compress true mapred output compression codec com hadoop compression GzipCodec Apart from these parameters MapReduce provides facilities for the application developer to specify compression for both intermediate map outputs and the job outputs that is the output of the reducers Such compression can be set up with CompressionCodec class implementation for the zlib compression algorithm in your custom MapReduce program For extensive details on Hadoop compression see the whitepaper http msdn microsoft com en us dn168917 aspx 225 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Concatenate Input Files Concatenation is another technique that can improve your MapReduce job performance The MapReduce program is designed to handle few larger files well in comparison to several smaller files Thus you can concatenate many small files into a few larger ones This needs to be done in the program code where you implement your own MapReduce job MapReduce can concatenate multiple small files to make it one block size which is more efficient in terms of storage and data movement Avoid Spilling All data in a Hadoop MapReduce job is handled as key value pairs All input
141. available for download You should download the appropriate version of the driver for your operating system and the application that will consume the driver and be sure to match the bitness For example if you want to consume the driver from the 32 bit version of Excel you will need to install the 32 bit Hive ODBC driver This chapter shows you how to create a basic schema structure in Hive load data into that schema and access the data using the ODBC driver from a client application Hive The Hadoop Data Warehouse Hive is a framework that sits on top of core Hadoop It acts as a data warehousing system on top of HDFS and provides easy query mechanisms to the underlying HDFS data By revisiting the Hadoop Ecosystem diagram in Chapter 1 you can see that Hive sits right on top of Hadoop core as shown in Figure 8 1 127 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Business Intelligence Excel Powerview Data Access Layer ODBC SQOOP REST Stats Processing Metadata HCatalog Graph Pegasus RHadoop Machine Learning Log File Aggregation Scripting Pig Query Hive Mahout Distributed Processing NoSQL Database HBase Map Reduce Ei E p S S o 2 a Distributed Storage HDFS Figure 8 1 The Hadoop ecosystem Programming MapReduce jobs can be tedious and they require their own development testing and maintenance investments Hive lets you democratize access to Big Data u
142. azurehdinsight net Port 443 H Database default l Hve Server Type Authentication Mechanism windows Azure HDInsight Service Realm Host FQDN Service Name HTTP Path User Name admin Password OTT 1 0 0 0 64bit Lal E Figure 8 7 Finishing the configuration Click on the Test button to make sure that a connection could be established successfully as shown in Figure 8 8 140 Microsoft Hive ODBC Driver DSN Setup ES Microsoft Hive ODBC Driver Data Source Test Test Results Driver Version V1 0 0 0 Running connectivity tests Attempting connection Connection established Disconnecting from server TESTS COMPLETED SUCCESSFULLY User Name Password v1 0 0 0 64 bit admin Advanced Options gt iet ox canc Figure 8 8 Testing a connection CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC There are a few settings of interest on the Advanced Options page of the DSN Setup screen The most important one is the Default string column length value By default this will be set to 65536 which is larger than the maximum string length of many client applications for example SSIS which may have negative performance implications If you know that your data values will be less than the maximum characters in length supported by your client application I recommend lowering this value to 4000 or less The other options you can control throug
143. ble and perform massive parallel processing so you can test your Big Data solution on the emulator Once you are satisfied you can deploy your actual solution to production in Azure and take advantage of multinode Hadoop clusters on Windows Azure For on premises use Microsoft is offering its Parallel Data Warehouse PDW technology which is an appliance based multinode HDInsight cluster while the emulator will continue to be single node and serve as a test bed 125 CHAPTER 8 Accessing HDinsight over Hive and ODBC If you are a SQL developer and want to cross pollinate your existing SQL skills in the world of Big Data Hive is probably the best place for you This section of the book will enable you to be the Queen Bee of your Hadoop world with Hive and gain business intelligence BI insights with Hive Query Language HQL filters and joins of Hadoop Distributed File System HDFS datasets Hive provides a schema to the underlying HDFS data and a SQL like query language to access that data Simba in collaboration with Microsoft provides an ODBC driver that is the supported and recommended interface for connecting to HDInsight It can enable client applications to connect and consume Hive data that resides on top of your HDFS WASB in case of HDInsight The driver is available for a free download at http www microsoft com en us download details aspx id 40886 The preceding link has both the 32 bit and 64 bit Hive ODBC drivers
144. bprint for your development system and bind it with your Azure subscription 42 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING So the next task is to use Windows Azure PowerShell to bind your Azure subscription details to your development machine You can install Azure PowerShell using the Web Platform Installer from the following link http go microsoft com fwlink p linkid 320376 amp c1lcid 0x409 Accept the license agreement and you should see the installation screen for Azure PowerShell as shown in Figure 4 5 OH Windows Azure PowerShell PREREQUISITES INSTALL RE FINISH Windows Azure Windows Azure is an open and flexible cloud platform that enables you to quickly build deploy and manage applications across a global network of Microsoft managed datacenters Easily create web sites virtual machines or databases in a few clicks Try it free now Download progress Windows Azure Emulator 2 0 597 KB sec Installing 1 Items to be installed Options Install Exit Figure 4 5 Web Platform Installer Once the installation is complete open the Windows Azure PowerShell console and execute the following command Get AzurePublishSettingsFile When prompted download and save the publishing profile and note the path and name of the publishsettings file Then execute the following command to import the subscription with the proper path to the publishsettings file Import AzurePublishSettingsFile C Users lt
145. bular database engines a brokering service a scheduling service SQL Agent and many other features As discussed in Chapter 1 it has become extremely important these days to integrate data between different sources The advantage that SQL Server brings is that it offers a powerful Business Intelligence BI stack which provides rich features for data mining and interactive reporting One of these BI components is an Extract Transform and Load ETL tool called SQL Server Integration Services SSIS ETL is a process to extract data mostly from different types of systems transform it into a structure that s more appropriate for reporting and analysis and finally load it into the database SSIS as an ETL tool offers the ability to merge structured and unstructured data by importing Hive data into SQL Server and apply powerful analytics on the integrated data Throughout the rest of this chapter you will get a basic lesson on how SSIS works and create a simple SSIS package to import data from Hive to SQL Server SSIS as an ETL Tool The primary objective of an ETL tool is to be able to import and export data to and from heterogeneous data sources This includes the ability to connect to external systems as well as to transform or clean the data while moving the data between the external systems and the databases SSIS can be used to import data to and from SQL Server It can even be used to move data between external non SQL systems without requir
146. cccccetscasesecsecaccenecsrecasecetccciccezecaceetecsacacsesnsscennenteccansexcaatacassetnenaeaestannunciecsetaries 40 Connecting to Your Subscription s ssssussssssnnnnssnnnnnnnnnnannnnnnnnnnnnnannnnnnnnnannnnnnnannnnnnnnnannnnnnnnnnnnnannnnannnnannnnnnnannnnan 42 Coding the APG AU RE 44 Using the PowerShell cmdlets for HDINSIQNt ccsssesssssssessesesessessesassessaesaesassassaesaeeaseats 51 Command Line Interface OE emcee 55 len Lu MY EE 58 Chapter 5 Submitting Jobs to Your HDInsight Cluster sssussssennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 59 Using the Hadoop NE KR 59 Adding the UE Ee 60 Submitting a Custom MapReduce JOD cscecssssscesesseseeseeseseeeeesseseseeasaseseeesasseseeasasseseneeasseseneeasseseneeasanaes 60 Submitting the wordcount MapReduce JOD esscssecssssseeeeesseeseesesseseeesaeeseeaeasseseeesasaeseeeeaseseeesasaeseeeeasaeaes 69 Submitting Ai MNO DODD EE 71 MOnttOriing JOD Status a iasetaccsccccacsdcetesansvesecnascestsensvensasdssnansaucsnsensdseadcaradsaeetaseusisixacsnararsecatesnisixassantantetadsiadecnidccne 74 Using PowerShell EE 80 Ulli Ri ln sirisser saisai aaarnas iaaea vost sucess wives cdsdbsi ante sci ab vu vader vdvevdbtbv saaa s EERE 80 EX CUTING The JOD WE 83 Using RUE eege 85 SUMMA EE 87 viii CONTENTS Chapter 6 Exploring the HDinsight Name Node sssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 89 Accessing the HDInsight Name NOd s cscs
147. ce job to return its results because there is no need to eliminate columns Hence there is less processing in the background 136 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Listing 8 10 Selecting all columns SELECT FROM stock analysis Note In cases where selecting only a few columns reduces a lot of the data to transfer it may still be interesting to select only a few columns In addition to common SQL semantics HiveQL supports the inclusion of custom MapReduce scripts embedded in a query through the MAP and REDUCE clauses as well as custom User Defined Functions UDFs implemented in Java This extensibility enables you to use HiveQL to perform complex transforms to data as it is queried For a complete reference on Hive data types and HQL see the Apache Hive language manual site https cwiki apache org confluence display Hive Home Hive Storage Hive stores all its metadata in its storage called a Hive MetaStore Traditional Hive uses its native Derby database by default but Hive can also be configured to use MySQL as its MetaStore With HDInsight this capability extends and the Hive MetaStore can be configured to be SQL Server as well as SQL Azure You can modify the Hive configuration file hive site xml found under the conf folder in the Hive installation directory to customize your MetaStore You can also customize the Hive MetaStore while deploying your HDInsight cluster through the CUSTOM CREATE wizard
148. che hadoop hdfs server namenode NameNode DateTime 2013 12 10T02 46 57 6211250Z Timestamp 3981611043 HadoopServiceTraceSource Information 0 ServiceHost OnStart DateTime 2013 12 10T02 46 57 6211250Z Times tamp 3981662789 HadoopServiceTraceSource Information 0 Child process started PID 3720 DateTime 2013 12 10T02 46 57 6211250Z Timestamp 3981707399 These logs record very low level service startup messages Most likely the information in them is external to the Hadoop system For example in a network failure scenario you might see an entry similar to the following in your namenode trace log file Session Terminated Killing shell It is very rare that these log files get populated with anything else apart from the service startup messages For example they might be populated in the case of a network heartbeat failure between the name node and the data nodes Still they can be helpful at times in figuring out why your DataNode NameNode or Secondary NameNode service isn t starting up or is sporadically shutting down 189 CHAPTER 11 LOGGING IN HDINSIGHT Note These trace og files are introduced with HDInsight cluster version 2 1 In version 1 6 clusters the file names are out log The following two sections are specific to HDInsight clusters in version 1 6 The log file types discussed are not available if the cluster version is 2 1 This holds good for the Windows Azure HDInsight Emulator since as of this
149. cluding Windows Server System Center Linux and others It supports heterogeneous languages including NET Java Node js Python and data services for No SQL SQL and Hadoop So if you need to tap into the power of Big Data simply pair Azure web sites with HDInsight to mine any size data and compelling business analytics to make adjustments to get the best possible business results A Windows Azure subscription grants you access to Windows Azure services and to the Windows Azure Management Portal https manage windowsazure com The terms of the Windows Azure account which is acquired through the Windows Azure Account Portal determine the scope of activities you can perform in the Management Portal and describe limits on available storage network and compute resources A Windows Azure subscription has two aspects The Windows Azure storage account through which resource usage is reported and services are billed Each account is identified by a Windows Live ID or corporate e mail account and associated with at least one subscription The account owner monitors usage and manages billings through the Windows Azure Account Center The subscription itself which controls the access and use of Windows Azure subscribed services by the subscription holder from the Management Portal 13 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE Figure 2 1 shows you the Windows Azure Management Portal which is your dashboard to manage all yo
150. crosoft WindowsAzure Management HDinsight ClusterProvisioning Data HDinsightClusterCreationDetailsset_Location0 Debugger Step Recorded CreateCluster Program cs line 58 Microsoft WindowsAzure Management HDInsight ClusterProvisioning Data HDinsightClusterCreationDetails set_DefaultStorageAccountName Debugger Step Recorded CreateCluster Program cs line 59 Microsoft WindowsAzure Management HDInsight ClusterProvisioning Data HDInsightClusterCreationDetails set_DefaultStorageAccountKey Debugger Step Recorded CreateCluster Program cs line 60 Microsoft WindowsAzure Management HDiInsight ClusterProvisioning Data HDinsightClusterCreationDetails set_DefaultStorageContainer Debugger Step Recorded CreateCluster Program cs line 61 Microsoft WindowsAzure Management HDInsight ClusterProvisioning Data HDInsightClusterCreationDetails set_UserName Debugger Step Recorded CreateCluster Program cs line 62 Microsoft WindowsAzure Management HDInsight ClusterProvisioning Data HDInsightClusterCreationDetails set_Password Debugger Step Recorded CreateCluster Program cs line 63 Microsoft WindowsAzure Management HDInsight ClusterProvisioning Data HDinsightClusterCreationDetails set_ClusterSizelnNodes o Debugger Step Recorded CreateCluster Program cs line 64 Microsoft WindowsAzure Management HDInsight ClusterProvisioning ClusterProvisioningClient CreateCluster DI Exception Thrown Object reference not set to an in
151. d IList lt ClusterInfo gt clusterInfos client GetClusters ClusterInfo clusterInfo clusterInfos 0 Console WriteLine Cluster Href 0 clusterInfo Href Regex clusterNameRegEx new Regex w var clusterName clusterNameRegEx Match Constants azureClusterUri Authority Groups 1 Value HostComponentMetric hostComponentMetric client GetHostComponentMetric clusterName azurehdinsight net 74 Console Console Console Console Console Console Console CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER WriteLine Cluster Map Reduce Metrics WriteLine tMaps Completed t o hostComponentMetric MapsCompleted WriteLine tMaps Failed t o hostComponentMetric MapsFailed WriteLine tMaps Killed t o hostComponentMetric MapsKilled WriteLine tMaps Launched t 0 hostComponentMetric MapsLaunched WriteLine tMaps Running t 0 hostComponentMetric MapsRunning WriteLine tMaps Waiting t o hostComponentMetric MapsWaiting When you execute the MonitorCluster method you should see output similar to the following Cluster Href https democluster azurehdinsight net ambari api monitoring v1 clusters democluster azurehdinsight net Cluster Map Redeuce Metrics Maps Maps Maps Maps Maps Maps Complet Failed Killed Launche Running Waiting ed 151 20 0 d 171 0 10 The Ambari APIs can be used as mentioned to display MapRe
152. d Connect to a database Select or enter a database name HiveDemo Test connection succeeded Test Connection Figure 10 10 Testing the SQL connection Note In this example chose OLE DB to connect to SQL You can also choose to use ADO NET or an ODBC connection to do the same Also a SQL database HiveDemo is pre created using SQL Server Management Studio Creating the Hive Source Component Next you need to configure a source component that will connect to Hive and fetch the data After the connection is successfully created double click to place an ADO NET source on the Data Flow canvas as shown in Figure 10 11 175 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES 1 Other Sources TA CDC Source Ea Excel Source gt Flat File Source a ODBC Source B OLE DB Source Figure 10 11 Creating the ADO NET source Note ODBC Source and ODBC Destination are a pair of data flow components that included with SSIS 2012 The lack of direct SSIS ODBC components was always a complaint from customers regarding the product hence Microsoft partnered with Attunity to make these components available as a part of the product Though the ODBC Source component supports many ODBC compliant data sources it does not currently support the Hive ODBC driver Today the only option to consume the Hive ODBC driver from SSIS is via the ADO NET components Right click the ADO NET source and select E
153. d Your Subscription Id public static string clusterUser admin public static string hadoopUser hdp public static string clusterPassword Your Password public static string storageAccount democluster public static string storageAccountKey Your _storage Key public static string container democlustercontainer public static string wasbPath wasb democlustercontainer democluster blob core windows net Note Connection to the HDinsight cluster defaults to the standard Secure Sockets Layer SSL port 443 However if you have a cluster prior to version 2 1 the connection is made through port 563 The constant hadoopUser is the user account that runs the Hadoop services on the NameNode By default this user is hdp in an HDInsight distribution You can always connect remotely to the NameNode and find this service account from the Windows Services console as shown in Figure 5 1 Name Status Startup Type Log Op s Sh Apache Hadoop Derbyserver Started Manual Ahdp ZS Apache Hadoop hiveserver Started Manual AT Ahdp Sh Apache Hadoop hiveserver2 Started Manual Ahdp Apache Hadoop isotopejs Started Manual s admin KOA Apache Hadoop jobtracker Started Manual Ahdp Manual Manual Manual KOA Apache Hadoop metastore Started pache Hadoop namenode Started Gh Apache Hadoop oozieservice Started dp Figure 5 1 Hadoop service account You will use these class
154. d multinode Hadoop solutions on their on premises Windows servers today the recommended option is to use HDP for Windows Microsoft has no plans whatsoever to make this emulator multinode and give it the shape of a production on premises Hadoop cluster on Windows 113 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR Installing the Emulator The Windows Azure HDInsight Emulator is installed with the Microsoft Web Platform Installer version 4 5 or higher The current distribution of HDInsight Emulator installs HDP 1 1 for Windows For more details about different HDP versions visit the Hortonworks web site http hortonworks com products hdp Note The Microsoft Web Platform Installer Web PI is a free tool that makes getting the latest components of the Microsoft Web Platform including Internet Information Services IIS SQL Server Express NET Framework and Visual Web Developer easy The Web PI also makes it easy to install and run the most popular free web applications for blogging content management and more with the built in Windows Web Application Gallery HDP 1 1 includes HDInsight cluster version 1 6 Microsoft plans to upgrade the emulator and match the version which as of now is version 2 1 that is deployed in the Azure Service The emulator currently supports Windows 7 Windows 8 and the Windows Server 2012 family of operating systems It can be downloaded from the following link http www microsoft com w
155. d releaseLocks gt 2013 11 15 13 37 11 514 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method releaseLocks start 1384522631512 end 1384522631513 duration 1 gt Much the same way if you try to drop a database that does not even exist you would see errors logged like those in Listing 13 10 228 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Listing 13 10 hive log file showing some errors 2013 11 15 14 25 31 810 INFO ql Driver PerfLogger java PerfLogBegin 100 lt PERFLOG method Driver run gt 2013 11 15 14 25 31 811 INFO ql Driver PerfLogger java PerfLogBegin 100 lt PERFLOG method TimeToSubmit gt 2013 11 15 14 25 31 811 INFO ql Driver PerfLogger java PerfLogBegin 100 lt PERFLOG method compile gt 2013 11 15 14 25 31 812 INFO parse ParseDriver ParseDriver java parse 179 Parsing command drop database hive 2013 11 15 14 25 31 813 INFO parse ParseDriver ParseDriver java parse 197 Parse Completed 2013 11 15 14 25 31 814 INFO ql Driver Driver java compile 442 Semantic Analysis Completed 2013 11 15 14 25 31 815 INFO ql Driver Driver java getSchema 259 Returning Hive schema Schema fieldSchemas null properties null 2013 11 15 14 25 31 816 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method compile start 1384525531811 end 1384525531816 duration 5 gt 2013 11 15 14 25 31 816 INFO ql Driver PerfLogger java PerfLogBegin 100 lt PERFLOG method D
156. data received by the user defined method that constitutes the reduce task is guaranteed to be sorted by key This sorting happens in two parts The first sorting happens local to each mapper as the mapper reads the input data from one or more splits and produces the output from the mapping phase The second sorting happens after a reducer has collected all the data from one or more mappers and then produces the output from the shuffle phase The process of spilling during the map phase is the phenomenon in which complete input to the mapper cannot be held in memory before the final sorting can be performed on the output from the mapper As each mapper reads input data from one or more splits the mapper requires an in memory buffer to hold the unsorted data as key value pairs If the Hadoop job configuration is not optimized for the type and size of the input data the buffer can get filled up before the mapper has finished reading its data In that case the mapper will sort the data already in the filled buffer partition that data serialize it and write spill it to the disk The result is referred to as a spill file Separate spill files are created each time a mapper has to spill data Once all the data has been read and spilled the mapper will read all the spilled files again sort and merge the data and write spill that data back into a single file known as an attempt file If there is more than one spill there must be one extra read and write
157. debarchan StockData tableIBM csv hdfs fs cp TableMSFT csv debarchan StockData tableMSFT csv hdfs fs cp TableOracle csv debarchan StockData tableOracle csv Note The file and folder names are case sensitive Also you will need to replace the user name value with the one you configured for Remote Desktop access 131 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC This will copy all the csv files under the debarchan StockData folder Once the source files are staged in your WASB you need to define the Hive schema that will be a placeholder for your Hive tables when you actually load data into it Note that to run the PowerShell commands you have to download and install Windows Azure HDInsight PowerShell as described in Chapter 4 The HDInsight PowerShell modules are integrated with Windows Azure Powershell version 0 7 2 and can be downloaded from http www windowsazure com en us documentation articles hdinsight install configure powershell Execute the command in Listing 8 4 to create the Hive table Listing 8 4 Creating the Hive table stock_analysis subscriptionName YourSubscriptionName storageAccountName democluster containerName democlustercontainer clustername democluster querystring create external table stock_analysis stock_symbol string stock Date string Stock price open double stock price high double stock price low double stock price Close double Stock volume int stock pric
158. der d SkyDrive dJ oozie 3 2 0 incubating 11 13 2013 11 04 File folder d pig 0 9 3 SNAPSHOT 10 22 2013 2 54PM_ File folder G Libraries d sqoop 1 4 2 11 13 2013 11 04 File folder Documents J templeton 0 1 4 11 13 2013 11 04 File folder Angie dee ae ANNEI A A ALA Diab Tand oat Figure 11 9 The emulator directory structure Note The logging infrastructure changes in the emulator are explained in detail in Chapter 7 Summary This chapter walked you through the logging mechanism used in the HDInsight service Although it focused on HDInsight specific logging operations it gives you a glimpse on how the traditional Apache Hadoop Log4j logging infrastructure can be leveraged as well You read about several logging optimizations to avoid logging and maintaining irrelevant footprints You also learned about enabling monitoring and logging on your Azure storage account through the Azure management portal Once an HDInsight cluster is operational and when it comes to consuming data you need to know about logging the client side driver calls as well At the end of the day data is viewed from interactive client applications like graphs and charting applications Logging the Hive ODBC driver calls is very essential because it forms the bridge between your client consumer and your Hadoop cluster 204 CHAPTER 12 Troubleshooting Cluster Deployments Once you really start to play around with your HDInsight clusters you are bou
159. des try to open the Hadoop Name Node Status portal and check if the number of live nodes is reported correctly Note Azure VMs periodically go through a process called re imaging where an existing VM is released and a new VM gets provisioned The node is expected to be down for up to 15 minutes when this happens This is an unattended automated process and the end user has absolutely no control over this ODBC failures deserve some additional attention You typically use a client like Microsoft Excel to create your data models from HDInsight data Any such front end tool leverages the Hive ODBC driver to connect to Hive running on HDInsight A typical failure can be like this Errors From Excel Unable to establish connection with hive server From PowerPivot Failed connect to the server Reason ERROR HY000 Invalid Attribute Persist Security Info ERROR 01004 Out connection string buffer not allocated ERROR 08001 Unable to establish connection with hive server To start with always make sure that the basic DSN configuration parameters such as port number authentication and so on are properly set For Azure HDInsight Service make sure that e You are connecting to port 443 e Hive Server Type is set to Hive Server 2 e Authentication Mechanism is set to Windows Azure HDInsight Service e The correct cluster user name and password are provided For the Azure HDInsight Emulator confirm that e You are connecting to port 10
160. dit to configure the source to connect to the target Hive table using the connection you just created Select the connection manager I named it Hive Connection and the Hive table in this case it s the stock_analysis table you created in Chapter 8 as shown in Figure 10 12 176 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES L ADO NET Source Editor EO O h Configure the properties used by a data flow to obtain data from any ADO NET provider Connection Manager Specify an ADO NET connection manager a data source or a data source view and select the data access mode After ses selecting the data access mode select from among the additional data access options that appear Error Output ADO NET connection manager Data access mode Table or view Name of the table or the view E default stock_analysis Figure 10 12 Selecting the Hive table Tip You also can create the connection manager on the fly while configuring the source component by clicking the New button adjacent to the ADO NET connection manager Click on the Preview button to preview the data and ensure that it is being fetched from the source without issue You should be able to see the first few rows from your Hive table as in Figure 10 13 177 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES Connection Manager Columns Error Output
161. doop tasktracker Started Automati Apache Hadoop templeton Started Automati Figure 7 8 The Apache Hadoop services 121 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR There are however changes in the port numbers of the REST APIs that the emulator exposes Logically enough the security constraints are much less restrictive in the local emulator than with the Azure service since the emulator resides in your local machine where you have more control You have to be careful opening the respective ports if you wish to use the REST APIs to obtain status version details and so forth Here is the list of the REST endpoints for the emulator along with their port numbers e Oozie http localhost 11000 oozie v1 admin status e Templeton http localhost 50111 templeton v1 status e ODBC Use port 10001 in the DSN configuration or connection string e Cluster Name Use http localhost 50111 as the cluster name wherever you require To start and stop the Hadoop services on the local emulator you can use the start onebox cmd and stop onebox cmd command files from the C Hadoop directory PowerShell versions for these files are available as well if you are a PowerShell fan as shown in Figure 7 9 ocal Disk C Hadoop gt d Open Print Burn Name E wyvvyvp ULL ULVI IJ start onebox cmd 9 start onebox psl E stop onebox cmd t E stop onebox psl Figure 7 9 Hadoop Service control files
162. doopClient HadoopClient bin Debug Microsoft Hadoop Client d11 EE EE EH Job job_201309161139_003 completed Note commented out the cluster management method calls in the Main function because we are focusing on only the MapReduce job part Also you may see a message about deleting the output folder if it already exists If for some reason the required environment variables are not set you may get an error like the following one while executing the project which indicates the environment is not suitable Environment Vairable not set HADOOP_HOME Environment Vairable not set Java HOME If you encounter such a situation add the following two lines of code to set the variables at the top of your DoCustomMapReduce method This is constant Environment SetEnvironmentVariable HADOOP_HOME c hadoop Needs to be Java path of the development machine Environment SetEnvironmentVariable Java_ HOME c hadoop jvm On successful completion the job returns the job id Using that you can track the details of the job in the Hadoop MapReduce Status or JobTracker portal by remotely connecting to the NameNode Figure 5 4 shows the preceding job s execution history in the JobTracker web application Running Jobs none Completed Jobs Jobid Started Priority User Name nie SE GE 4 pyaar ty l ob 201309161139 0001 121048 An NORMAL admin TempletonControllerJob 100 00 1 1 100 00 2013 ob 201309161139 000
163. ds As with all other Azure storage methods access is provided through REST APIs which you can access at the following site http debarchans table core winodws net MyTableStore e Queue storage Queues are used to transport messages between applications Azure queues are conceptually the same as Microsoft Messaging Queue MSMQ except that they are for the cloud Again REST API access is available For example this could be an URL like http debarchans queue core windows net MyQueueStore Note HDinsight supports only Azure blob storage Azure storage accounts The HDInsight provision process requires a Windows Azure Storage account to be used as the default file system The storage locations are referred to as Windows Azure Storage Blob WASB and the acronym WASB is used to access them WASB is actually a thin wrapper on the underlying Windows Azure Blob Storage WABS infrastructure which exposes blob storage as HDFS in HDInsight and is a notable change in Microsoft s implementation of Hadoop on Windows Azure Learn more about WASB in the upcoming section Understanding the Windows Azure Storage Blob For instructions on creating a storage account see the following URL http www windowsazure com en us manage services storage how to create a storage account 16 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE The HDInsight service provides access to the distributed file system that is locally attached to th
164. dt fetch a delegation token from the NameNode jobtracker run the MapReduce job Tracker node pipes run a Pipes job tasktracker run a MapReduce task Tracker node historyserver run job history servers as a standalone daemon job manipulate Map queue get information rega duce jobs ing JobQueues version print the version jar lt jar gt run a jar file p MapRed distcp lt srcurl gt lt desturl gt copy file or directories recursively distcp2 lt srcurl gt lt desturl gt DistCp version 2 archive archiveName NAME lt src gt lt dest gt create a hadoop archive daemonlog get set the log level for each daemon or CLASSNAME run the class named CLASSNAME st commands print help when invoked w o parameters Ic apps dist hadoop 1 2 6 1 3 1 8 6 gt _ B adoop Name Node Status Figure 6 6 The Hadoop command line This will look very familiar to traditional Hadoop users because this is exactly what you find in the Apache Open Source project Again the point to be noted here is HDInsight is built on top of core Hadoop so it supports all the interfaces available with core Hadoop including the command prompt For example you can run the standard ls command to list the directory and file structure of the current directory The command in Listing 6 1 lists the files and folders you have in the root of your container Listing 6 1 The HDFS directory structure hadoop dfs ls This command lists the files and folders in the r
165. duce metrics for your cluster The NET SDK also supports other functionalities like data serialization using the Open Source Apache project Avro For a complete list of the SDK functionalities refer to the following site http hadoopsdk codeplex com Through the HadoopClient program we automated MapReduce and Hive job submissions Bundled together with the cluster management operations in the previous chapter the complete Program cs file along with the using statements should now look similar to Listing 5 13 Listing 5 13 The complete code listing using System 3 using System Collections Generic using System using System Ling Text using System Security Cryptography X509Certificates using Microsoft WindowsAzure Management HDInsight using Microsoft Hadoop MapReduce using Microsoft Hadoop Client For Stream using System I0 IO For Ambari Monitoring Client using Microsoft Hadoop WebClient AmbariClient using Microsoft Hadoop WebClient AmbariClient Contracts For Regex using System Text RegularExpressions For thread using System Threading 75 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER For Blob Storage using Microsoft WindowsAzure Storage using Microsoft WindowsAzure Storage Blob namespace HadoopClient 76 class Program static void Main string args ListClusters CreateCluster DeleteCluster DoCustomMapReduce
166. e Select the stock_price_adj_close column and set its type to Decimal as well Next import the DimDate table from AdventureWorksDWH database in SQL Server to be able to create a date hierarchy Click on Get External Data gt From Other Sources gt Microsoft SQL Server and provide the SQL Server connection details as shown in Figure 9 13 154 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS G F ia Data Type Decimal Number i 4 EY P gt AutoSum Format General 2 Create KPI Get External Refresh PivotTable Clear All Sort by Find Data a e 328 Ze Filters Column Formatting Sort and Filter Find Calculations Enter the information required to connect to the Microsoft SQL Server database Friendly connection name SqlServer localhost Adventure WorksDWH Server name v AdventureWorksDWH Figure 9 13 Getting data from the AdventureWorksDWH database in SQL Server Click on Next choose to import from table directly and click on Next again Select the DimDate table from the available list of tables to import in the model as shown in Figure 9 14 155 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS Table Import Wizard Select Tables and Views Select the tables and views that you want to import data from Server localhost Database AdventureV orksDW H Tables and Views l _ Source Table Schema Friendly Name Filter Details s T f AdventureWor
167. e Windows Azure Blob Storage is a highly available scalable high capacity low cost and shareable storage option for data that is to be processed using HDInsight Storing data in WASB enables your HDInsight clusters to be independent of the underlying storage used for computation and you can safely release those clusters without losing data The first step toward deploying an HDInsight solution on Azure is to decide on a way to upload data to WASB efficiently We are talking BigData here Typically the data that needs to be uploaded for processing will be in the terabytes and petabytes This section highlights some off the shelf tools from third parties that can help in uploading such large volumes to WASB storage Some of the tools are free and some you need to purchase Azure Storage Explorer A free tool that is available from codeplex com It provides a nice Graphical User Interface from which to manage your Azure Blob containers It supports all three types of Azure storage blobs tables and queues This tool can be downloaded from http azurestorageexplorer codeplex com 19 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE Cloud Storage Studio 2 This is a paid tool giving you complete control of your Windows Azure blobs tables and queues You can get a 30 day trial version of the tool from here http www cerebrata com products cloud storage studio introduction CloudXplorer This is also a paid tool available for A
168. e core site xml file in the C Hadoop hadoop 1 1 0 SNAPSHOT conf directory You can add your Azure storage account key and container in the configuration file to point to Windows Azure Storage Blob WASB Listing 7 1 shows a sample entry in the core site xml file Listing 7 1 core site xml lt property gt lt name gt fs azure account key democluster blob core windows net lt name gt lt value gt your_storage_account_key lt value gt lt property gt lt property gt lt name gt fs default name lt name gt lt cluster variant gt lt value gt wasb democlustercontainer democluster blob core windows net lt value gt 120 lt lt value gt hdfs localhost 8020 lt value gt gt lt description gt The name of the default file system Either the CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR literal string local or a host port for NDFS lt description gt lt final gt true lt final gt lt property gt lt property gt lt name gt dfs namenode rpc address lt name gt lt value gt hdfs localhost 8020 lt value gt lt description gt A base for other temporary directories lt description gt lt property gt Note have a storage account democluster and a default container democlustercontainer You may need to replace these values with your own There is also a way to emulate Azure blob storage in your local machine where you have installed the HDInsight emulator You can
169. e Data Flow canvas Make sure you connect the ADO NET source and the OLE DB Destination components by dragging the arrow between the source and the destination This is required for SSIS to generate the metadata and the column mappings for the destination automatically based on the source schema structure The package should look something like Figure 10 15 179 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES Other Destinations ES ADO NET Destination 2 Data Mining Model Trai L DataReader Destination LZ Dimension Processing EA Excel Destination a ADO NET E Flat File Destination eg EN ODBC Destination Cl Partition Processing E Raw File Destination Ug Recordset Destination Gi SQL Server Compact De E OLE DB Destination e a SQL Server Destination A Figure 10 15 Creating the OLE DB destination Note In this example used the OLE DB Destination component to bind to the target SQL Server table However you also can use the ADO NET Destination or SQL Server Destination components for the same purpose However be aware that SQL Server Destination works only if the package runs locally on the same system where SQL Server resides Now it is time to configure the OLE DB Destination component to point to the correct SQL connection and database table To do this right click the OLE DB Destination component and select Edit Select the OLE DB connection manager to SQL that you just created and th
170. e Templeton service Access to the services in Table 6 2 gives you control of the different programs you need to run on your Hadoop cluster If it is a really busy cluster doing only core MapReduce processing you might want to stop the services fora few supporting projects like Hive and Oozie which are not used at that point Your Azure Management portal gives you an option to turn all Hadoop services on or off as a whole as shown in Figure 6 18 However through the name node s Services console you can selectively turn off or on any of the services you want 109 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE democluster AA DASHBOARD MONITOR CONFIGURATION demociuster cluster connectivity HADOOP SERVICES OFF Figure 6 18 Toggle Hadoop services datadork tutorial Installation Directory HDInsight distribution deploys core Hadoop and the supporting projects to the C apps dist directory of the name node The folder and directory structure of the components are almost the same as in the Open Source projects to maintain consistency and compatibility The directory structure for your name node should look like Figure 6 19 v Computer v Local Disk C apps v dist Include in library Share with New folder Name Date modified Type J bin 12 10 2013 2 46AM File Folder ds A examples 12 10 2013 2 46 AM File Folder laces D hadoop 1 2 0 1 3 1 0 06 12 10 2013 2 45AM File Folder A heatalog 0 11 0 1 3 1 0
171. e adi close double partitioned by exchange string row format delimited fields terminated by LOCATION wasb denoel usteicdntainetedenoelucter blob core windows net debarchan StockData Note You may need to wrap each of the commands in single line depending on the PowerShell editor you use Otherwise you may encounter syntactical errors while running the script HiveJobDefinition New AzureHDInsightHiveJobDefinition Query querystring HiveJob Start AzureHDInsightJob Subscription subscriptionname Cluster clustername JobDefinition HiveJobDefinition HiveJob Wait AzureHDInsightJob Subscription subscriptionname WaitTimeoutInSeconds 3600 Get AzureHDInsightJobOutput Cluster clustername Subscription subscriptionname JobId HiveJob JobId StandardError Once the job execution is complete you should see output similar to the following StatusDirectory 2b391c76 2d33 42c4 a116 d967eb11c115 ExitCode 0 Name Hive create external table Query create external table stock_analysis stock symbol string stock Date string stock price open double stock price high double 132 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC stock_price_low double stock price close double stock_volume Int stock price adi close double partitioned by exchange string row format delimited fields terminated by LOCATION wasb democlustercontainer democluster blob core windows net debarchan StockData
172. e aggregation function Once the measures are created click on PivotTable gt PivotChart as shown in Figure 9 18 That will open a new worksheet with a chart Refresh Data Type Text Bl SortAatoz Sr Fer Format Text Z SortZtoA XK ELT Clear All Sort by gt 768 33 Rp ClearSort Fitters Column PivotIable Sort and Filter stock_price z Chart and Table Horizontal tock_price close E EZA ns H Chart and Table Vertical 195 5 wa Two Charts Horizontal 195 16 d Two Charts Vertical 195 81 d Four Charts 195 04 2771 Flattened PivotTable 196 01 197 19 195 53 196 21 SE aac 107 ae Figure 9 18 Creating a PivotChart Once the new worksheet with the data models is open drag and drop stock_symbol to Legend Series Then drag HDate to Axis Category and Average of Stock Price Close to Values as shown Figure 9 19 159 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS PivotChart Fields Sa ACTIVE ALL Choose fields to add to report Gv stock date E stock pnce open _ stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close exchange Average of stock_price_open Average of stock_price_high Average of stock_price_low DOHIN Average of stock_price_clo Sum of stock_volume Drag fields between areas below Y FILTERS Ill LEGEND SERIES stock_symbol_ AXIS CATEG E VALUES HDate be Average of st v
173. e compute nodes This file system can be accessed using the fully qualified URI for example hdfs lt namenode gt lt path gt The syntax to access WASB is WASB lt container gt lt accountname gt blob core windows net lt path gt Hadoop supports the notion of a default file system The default file system implies a default scheme and authority it can also be used to resolve relative paths During the HDInsight provision process you must specify blob storage and a container used as the default file system to maintain compatibility with core Hadoop s concept of default file system This action adds an entry to the configuration file C apps dist hadoop 1 1 0 SNAPSHOT conf core site xml for the blob store container Caution Once a storage account is chosen it cannot be changed If the storage account is removed the cluster will no longer be available for use Accessing containers In addition to accessing the blob storage container designated as the default file system you can also access containers that reside in the same Windows Azure storage account or different Windows Azure storage accounts by modifying C apps dist hadoop 1 1 0 SNAPSHOT conf core site xml and adding additional entries for the storage accounts For example you can add entries for the following e Container in the same storage account Because the account name and key are stored in the core site xml during provisioning you have full access to the
174. e output similar to the following once the file is downloaded info Executing command account download info Launching browser to http go microsoft com fwlink LinkId 254432 help Save the downloaded file then execute the command help account import lt file gt info account download command OK The next step is to import the file in CLI using the following command azure account import lt publishsettings file gt The file should be successfully imported and the output will be similar to the following info Executing command account import info Found subscription lt subscription_name gt info Setting default subscription to lt subscription_name gt 56 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING info Use azure account set to change to a different one info Setting service endpoint to https management core windows net warn The file name file contains sensitive information warn Remember to delete it now that it has been imported info Account publish settings imported successfully info account import command OK To list the existing HDInsight clusters in your subscription you can use the following command azure hdinsight cluster list The output will be the list of your already provisioned clusters in the running state It will be something similar to the following which I generated with four HDInsight clusters under my subscription info Executing command hdinsight cluster list Getting H
175. e system Either the literal string local or a host port for NDFS lt description gt lt final gt true lt final gt lt property gt If there is an issue with accessing your storage that is causing your jobs to fail the core site xml file is the first place where you should confirm that your cluster is pointing toward the correct storage account and container The core site xml file also has an attribute for the storage key as shown in Listing 13 2 If you are encountering 502 403 Forbidden Authentication errors while accessing your storage you must make sure that the proper storage account key is provided 220 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Listing 13 2 Storage account key lt property gt lt name gt fs azure account key democluster blob core windows net lt name gt lt value gt YourStorageAccountKey lt value gt lt property gt There are also several Azure throttling factors and blob IO buffer parameters that can be set through the core site xml file They are outlined in Listing 13 3 Listing 13 3 Azure throttling factors lt property gt lt name gt fs azure selfthrottling write factor lt name gt lt value gt 1 000000 lt value gt lt property gt lt property gt lt name gt fs azure selfthrottling read factor lt name gt lt value gt 1 000000 lt value gt lt property gt lt property gt lt name gt fs azure buffer dir lt name gt lt value gt tmp lt value gt lt prop
176. e target table In this case I named the connection SQL Connection and predefined a table created in the SQL database called stock_analysis If you don t have the table precreated you can choose to create the destination table on the fly by clicking the New button adjacent to the name of the table or the view drop down list This is illustrated in Figure 10 16 180 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES OLE DB Destination Editor arma Configure the properties used to insert data into a relational database using an OLE DB provider Specify an OLE DB connection manager a data source or a data source view and select the data access mode If using P the SQL command access mode specify the SQL command either by typing the query or by using Query Builder For Mappings fast load data access set the table update options Error Output OLE DB connection manager Data access mode Table or view v Name of the table or the view View Existing Data E Ge Ge Figure 10 16 Choosing the target SQL Server table Note You also can create the connection manager and the database table on the fly while configuring the destination component by clicking on the respective New buttons as shown in Figure 10 16 Mapping the Columns After you set up the connection manager and select the destination table navigate to the Mappings tab to ensure the co
177. eating an efficient cluster configuration and the ongoing administration required With storage being a commodity people are looking for easy off the shelf offerings for Hadoop solutions This has led to companies like Cloudera Green Plum and others offering their own distribution of Hadoop solutions as an out of the box package The objective is to make Hadoop solutions easily configurable as well as to make it available on diverse platforms This has been a grand success in this era of predictive analysis through Twitter pervasive use of social media and the popularity of the self service BI concept The future of IT is integration it could be integration between closed and open source projects integration between unstructured and structured data or some other form of integration With the luxury of being able to store any type of data inexpensively the world is looking forward to entire new dimensions of data processing and analytics CHAPTER 1 INTRODUCING HDINSIGHT Note HDinsight currently supports Hive Pig Oozie Sqoop and HCatalog out of the box The plan is to also ship HBase and Flume in future versions The beauty of HDInsight or any other distribution is that it is implemented on top of the Hadoop core So you can install and configure any of these supporting projects on the default install There is also every possibility that HDInsight will support more of these projects going forward depending on user demand Micr
178. eb handlers webpi ashx getinstaller HDINSIGHT PREVIEW appids You can also go to the Emulator installation page and launch the installer http www microsoft com web gallery install aspx appid HDINSIGHT When prompted execute the installer and you should see the Web Platform Installer ready to install the emulator as shown in Figure 7 1 114 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR This product refers to a preview version of the Microsoft HDInsight Emulator for Windows Azure formerly known as HDInsight Developer Preview This product has now been officially released and is no longer in preview You will be automatically redirected to the latest version during this installation but please note that in the future any links that refer directly to this preview version may be disabled The current release of the emulator will continue to be available by searching for Microsoft HDInsight Emulator for Windows Azure in the Web Platform Installer or at the More Information link below informati Publi Mi f Version PREVIEW Release date Monday October 28 2013 Figure 7 1 Web PI Click on install and accept the license terms to start the emulator installation As stated earlier it will download and install the Hortonworks Data Platform in your server as shown in Figure 7 2 115 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR e v Web Platform Installer 4 6 EE PREREQUISITES INSTALL CONFIGURE FINISH W
179. ect Avro In your HadoopClient solution install the Microsoft WindowsAzure Management HDInsight package by running the following command in the Package Manager Console install package Microsoft WindowsAzure Management HDInsight Figure 4 3 shows how you would type the command into the Visual Studio Package Manager Console 41 CHAPTER A AUTOMATING HDINSIGHT CLUSTER PROVISIONING 100 Package Manager Console Package source nuget org ZE Default project HadoopClient e G Each package is licensed to you by its owner Microsoft is not responsible for nor does it grant any licenses to third party packages a Some packages may include dependencies which are governed by additional licenses Follow the package source feed URL to determine any dependencies Package Manager Console Host Version 2 7 4 808 167 Type get help NuGet to see all available NuGet commands PM gt install package Microsoft WindowsAzure Management HDInsight Figure 4 3 Install the NuGet package You should see the following output if the package is imported successfully Installing Microsoft WindowsAzure Management HDInsight 0 9 4951 25594 Successfully installed Microsoft WindowsAzure Management HDInsight 0 9 4951 25594 Adding Microsoft WindowsAzure Management HDInsight 0 9 4951 25594 to HadoopClient Successfully added Microsoft WindowsAzure Management HDInsight 0 9 4951 25594 to HadoopClient Note The version numbers that yo
180. ectory rename failed with exception System 10 IOException The process cannot access the file because it is being used by another process at System I0 Error WinIO0Error Int32 errorCode String maybeFullPath at System 10 Directory Move String sourceDirName String destDirName at Microsoft Hadoop Deployment Engine Commands AzureBeforeHadoopInstal1lCommand Execute DeploymentContext deploymentContext INodeStore nodeStore Version 1 0 0 0 ActivityId 8f270dd7 4691 4a69 945f e0a1a81605c1 AzureVMName RDO00155D6135E3 IsException False ExceptionType ExceptionMessage InnerExceptionType InnerExceptionMessage Exception You may also come across scenarios where the cluster creation process completes but you don t see the packages that should have been deployed to the nodes in place For example the cluster deployment is done but you don t find Hive installed in the C Apps Dist directory These installer and deployment logs could give you some insight if something went wrong after VM provisioning In most of these cases re creating the cluster is the easiest and recommended solution For the HDInsight emulator the same pair of deployment logs is generated but in a different directory They can be found in the C HadoopInstallFiles directory as shown in Figure 12 1 ea puter Local Disk C HadooplnstallFiles v 4 E lein library v Share with v Burn New folder 2 Name Date modified Type Jo HadoopPac
181. ed mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred mapred LoadSnappy JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient JobClient Snappy native library loaded Running job job 201311240635 0196 map 0 reduce 0 map 100 reduce 0 map 100 reduce 33 map 100 reduce 100 Job complete job 201311240635 0196 Counters 29 Job Counters Launched reduce tasks 1 SLOTS MILLIS MAPS 8968 Total time spent by all reduces waiting Total time spent by all maps waiting after Launched map tasks 1 SLOTS MILLIS REDUCES 10562 File Output Format Counters Bytes Written 337623 FileSystemCounters WASB_ BYTES READ 1395666 FILE BYTES READ 466915 HDFS BYTES READ 161 FILE BYTES WRITTEN 1057448 WASB_ BYTES WRITTEN 337623 File Input Format Counters Bytes Read 1395667 Map Reduce Framework Map output materialized bytes 466761 Map input records 32118 Reduce shuffle bytes 466761 Spilled Records 65912 Map output bytes 2387798 Total committed heap usage CPU time spent ms 74
182. ed JobClient 13 12 09 19 47 42 INFO mapred JobClient ICloudBlob BlobType BlockBlob Length gt 337623 ContentType application octet stream LastModified 12 9 2013 7 47 39 PM 00 00 84 Rack local map tasks 1 Launched map tasks 1 SLOTS MILLIS REDUCES 10640 File Output Format Counters Bytes Written 337623 FileSystemCounters WASB_ BYTES READ 1395666 FILE BYTES READ 466915 HDFS BYTES READ 161 FILE BYTES WRITTEN 1053887 WASB_ BYTES WRITTEN 337623 File Input Format Counters Bytes Read 1395667 Map Reduce Framework Map output materialized bytes 466761 Map input records 32118 Reduce shuffle bytes 466761 Spilled Records 65912 Map output bytes 2387798 Total committed heap usage bytes 1029046272 CPU time spent ms 7547 Combine input records 251357 SPLIT RAW BYTES 161 Reduce input records 32956 Reduce input groups 32956 Combine output records 32956 Physical memory bytes snapshot 495923200 Reduce output records 32956 Virtual memory bytes snapshot 1430675456 Map output records 251357 Microsoft WindowsAzure Storage Blob CloudBlockBlob CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER SnapshotTime Context Microsoft WindowsAzure Commands Storage Model ResourceModel AzureStorageContext Name example data WordCountOutputPS part r 00000 human 57 humana 1 humane 1 humani 2 humanist 1 humanists 1 humanorum 1 inhuman 1 l humano 1 Depending on your computer s security policie
183. eduler xml 12 10 2013 2 45 AM XML Document 8 KB ie Downloads configuration xsl 10 2 2013 9 41 PM XSL Stylesheet 2KB E Recent Places S core site xml 12 10 2013 2 45AM XML Document 4KB Fair scheduler xml 10 2 2013 9 41 PM XML Document 1 K6 3L ellen 23 hadoop env cmd 10 2 2013 9 41 PM Windows Command ZER Documents A Music hadoop env sh 10 2 2013 9 41 PM SH File 3 KB Pictures hadoop metrics2 properties 12 10 2013 2 46 AM PROPERTIES File KS 4 yideos __ hadoop metrics2 azure file system properties 12 10 2013 2 46 AM PROPERTIES File 1 KB hadoop policy xml 10 2 2013 9 41 PM XML Document SKB amp Computer hdfs site xml 12 10 2013 2 45 AM XML Document AER BB 1004 properties 12 10 2013 2 46 AM PROPERTIES File ig Network 7 mapred queue acls xml 10 2 2013 9 41 PM XML Document 3KB mapred site xml 12 10 2013 2 45 AM XML Document SKB masters 10 2 2013 9 41 PM File 1 KB __ slaves 10 2 2013 9 41 PM File 1 K6 Figure 11 4 Log4j properties file There is a section in the file where you can specify the level of details to be recorded The following code shows a snippet of the properties file FSNamesystem Audit logging All audit events are logged at INFO level log4j logger org apache hadoop hdfs server namenode FSNamesystem audit WARN Custom Logging levels hadoop metrics 1log level WARN 10g4j logger org apache hadoop mapred JobTracker DEBUG 10g4j logger org apache hadoop mapred TaskTr
184. eesassassessessaseasas 13 Windows Azure HDinsight Service sisseseccsccssttesccstesssatesetscesesatisetasitiiatssetatnetatesetaenieatanstaenieiaiess 14 NIE ua el CN 15 Storage LOCATION O OTN OINS areri rerea aee Eae E Ee Ea EES 16 Windows Azure Flat Network Storage sssssssssessnuusununnnnunnnnannnnunnnnnnnnnnnnnnannnnannannnannnnnnnunannnnnnnnnnnnnnnnnannnnannnnan anaa 20 lu nn eee 22 Chapter 3 Provisioning Your HDinsight Service Cluster sssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 23 Creating the Storage ACCOUNL sccssssssessessessessesessesessesseseesoeseesoesoesoesoesoesoeseesoesoesoesess 23 Creating a SQL KOHNEN 26 vii CONTENTS Deploying Your HDInsight Cluster cccsssssssesssssssssssseeseeseeseesesseeseesasseesassaesessessassassansansass 27 Customizing Your Cluster Creation wicciscisccccisctsscssescisstsscscesscaassccscesssscsasesssasinasaccnceasssasanssaeabisaianis 28 Configuring the Cluster User and Hive Qozie Storage cssssssssssesesssseseseseseesaeseesaseaeeaes 29 Choosing Your Storage Account geet 30 Finishing the Cluster Creation cccscssssessessessessessessessessessesesseseensesseseesseseesseseeseesoesees 32 Monitoring hu He ET 33 Configuring the Cluster E 34 SUMMA nnan aes aaa uee aud eee eee 37 Chapter 4 Automating HDinsight Cluster Provisioning sssssssssunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 39 Using the Hadoop NE KC HE 39 Adding the NUGGtPACKaGe wzccccccsc
185. efinition new MapReduceJobCreateParameters JarFile wasb example jars hadoop examples jar ClassName wordcount 77 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER mrJobDefinition Arguments Add wasb example data gutenberg davinci txt mrJobDefinition Arguments Add wasb example data WordCountOutput Get certificate var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new JobSubmissionCertificateCredential Constants subscriptionId cert Constants clusterName Create a hadoop client to connect to HDInsight var jobClient JobSubmissionClientFactory Connect creds Run the MapReduce job JobCreationResults mrJobResults jobClient CreateMapReduceJob mrJobDefinition Console Write Executing WordCount MapReduce Job Wait for the job to complete Wai Print the MapReduce job output Stream stream new MemoryStream CloudStorageAccount storageAccount CloudStorageAccount Parse DefaultEndpointsProtocol https AccountName _ Constants storageAccount AccountKey Constants storageAccountKey CloudBlobClient blobClient storageAccount CreateCloudBlobClient CloudBlobContainer blobContainer blobClient GetContainerReference Constants container CloudBlockBlob blockBlob blobContainer GetBlockBlobReference examp
186. el Note The TaskTracker service runs on the data nodes so there is no shortcut created for that portal in the name node You need to log on remotely to any of your data nodes to launch the TaskTracker portal Remember the remote logon session needs to be initiated from the name node Remote Desktop session itself It will not work if you try to connect remotely to your data node from your client workstation This Java based web portal displays the status of the completed tasks along with their status as shown in Figure 6 15 107 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Running tasks Task Attempts Status Progress Errors Non Running Tasks Task Attempts Status attempt_201312100246_0021_m_000001_0 SUCCEEDED attempt_201312100246_0021_m_000000_0 SUCCEEDED Tasks from Running Jobs Task Attempts Status Progress Errors attempt_201312100246_0021_m_000001_0 SUCCEEDED 100 00 attempt_201312100246_0021_m_000000_0 SUCCEEDED 100 00 Figure 6 15 The TaskTracker web portal The Running tasks section of the TaskTracker is populated only if a job which comprises one or more tasks is in execution at that point of time If any MapReduce job is running in the cluster this section will show the details of each of the Map and Reduce tasks as shown in Figure 6 16 Running tasks Task Attempts Status Progress Errors attempt_201312100246_0025_m_000000_0 RUNNING
187. en options to be 4 8 16 or 32 Any number of data nodes can be specified when using the CUSTOM CREATE option discussed in the next section Pricing details on the billing rates for various cluster sizes are available Click on the symbol just above the drop down box and follow the link on the popup Customizing Your Cluster Creation You can also choose CUSTOM CREATE to customize your cluster creation further Clicking on CUSTOM CREATE launches a three step wizard The first step requires you to provide the cluster name and specify the number of nodes as shown in Figure 3 9 You can specify your data center region and any number of nodes here unlike the fixed set of options available with QUICK CREATE 28 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER NEW HDINSIGHT CLUSTER Cluster Details CLUSTER NAME democluster azurehdinsight net SUBSCRIPTION NAME mee M DATA NODES 4 The cluster size affects the cluster price Pricing details HDINSIGHT VERSION default REGION North Europe Figure 3 9 Customizing the cluster creation Configuring the Cluster User and Hive Oozie Storage Click on the Next arrow in the bottom right corner of the wizard to bring up the Configure Cluster User screen Provide the cluster credentials you would like to be set for accessing the HDInsight cluster Here you can specify the Hive Oozie metastore to be the SQL Azure database y
188. er The HDInsight Service brings you the simplicity of deploying and managing your Hadoop clusters in the cloud and it enables you to do that in a matter of just a few minutes Enterprises can now free themselves of the considerable cost and effort of configuring deploying and managing Hadoop clusters for their data mining needs As a part of its Infrastructure as a Service IaaS offerings HDInsight also provides a cost efficient approach to managing and storing data The HDInsight Service uses Windows Azure blob storage as the default file system Note An Azure storage account is required to provision a cluster The storage account you associate with your cluster is where you will store the data that you will analyze in HDInsight Creating the Storage Account You can have multiple storage accounts under your Azure subscription You can choose any of the existing storage accounts you already have where you want to persist your HDInsight clusters data but it is always a good practice to have dedicated storage accounts for each of your Azure services You can even choose to have your storage accounts in different data centers distributed geographically to reduce the impact on the rest of the services in the unlikely event that a data center goes down To create a storage account log on to your Windows Azure Management Portal https manage windowsazure com and navigate to the storage section as shown in Figure 3 1 23 CHAPTER 3
189. er Storage is set up using your Storage Account and Key A container is then created with a default name matching the storage account name Note You can customize Storage Account links if you wish When this is successful all the preconditions for setup have been met If you encounter a failure at this step it is highly likely that you have provided incorrect storage account details or duplicate container name contuinued 205 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS Table 12 1 continued Status What it means Windows Azure VM The HDInsight Deployment Service makes calls to Azure to initiate the provisioning Configuration of virtual machines VMs for the Head Node Worker Nodes and Gateway Node s The gateway acts as the security boundary between the cluster and the outside world All traffic coming into the cluster goes through the gateway for authentication and authorization The gateway can be thought of as a proxy that performs the necessary operations and forwards the request to the appropriate cluster components So if you try to connect through the templeton or hive from say Excel the call enters the gateway and then is proxied through to the rest of the components HDInsight Configuration On startup each node runs custom actions that download and install the appropriate components These actions are coordinated by the individual node s local Deployment Agent Installations of the Java Runtime Hortonwork
190. erty gt lt property gt lt name gt io file buffer size lt name gt lt value gt 131072 lt value gt lt property gt Note Azure throttling is discussed in the section Windows Azure Storage later in this chapter mapred site xml The mapred site xml file has the configuration settings for MapReduce services It contains parameters for the JobTracker and TaskTracker processes These parameters determine where the MapReduce jobs place their intermediate files and control files the virtual memory usage by the Map and Reduce jobs the maximum numbers of mappers and reducers and many such settings In the case of a poorly performing job optimizations such as moving the intermediate files to a fast Redundant Array of Inexpensive Disks RAID can be really helpful Also in certain scenarios when you know your job well you may want to control the number of mappers or reducers being spawned for your job or increase the default timeout that is set for Map jobs Listing 13 4 shows a few of the important attributes in mapred site xml 221 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Listing 13 4 mapred site xml lt property gt lt name gt mapred tasktracker map tasks maximum lt name gt lt value gt 4 lt value gt lt property gt lt property gt lt name gt mapred tasktracker reduce tasks maximum lt name gt lt value gt 2 lt value gt lt property gt lt property gt lt name gt mapred map max attempts
191. es and those pipelines can push data directly to WASB e Azure blob storage is a useful place to store data across diverse services In a typical case HDInsight is a piece of a larger solution in Windows Azure Azure blob storage can be the common link for unstructured blob data in such an environment Note Most HDFS commands such as 1s copyFromLocal and mkdir will still work as expected Only the commands that are specific to the native HDFS implementation which is referred to as DFS such as fschk and dfsadmin will show different behavior on WASB Figure 2 2 shows the architecture of an HDInsight service using WASB 18 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE Master Worker Node Worker Node st vst H was DFS Windows Azure Storage Blob tt Container 1 Default Container 2 Container n File System Figure 2 2 HDInsight with Azure blob storage As illustrated in Figure 2 2 the master node as well as the worker nodes in an HDInsight cluster default to WASB storage but they also have the option to fall back to traditional DFS In the case of default WASB the nodes in turn use the underlying containers in the Windows Azure blob storage Uploading Data to Windows Azure Storage Blob Windows Azure HDInsight clusters are typically deployed to execute MapReduce jobs and are dropped once these jobs have completed Retaining large volumes data in HDFS after computations are done is not at all cost effectiv
192. ession Task H File System Task FTP Task G Data Flow Task Figure 10 4 SSIS data flow task Double click the data flow task or click the Data Flow tab in SSDT to edit the task and design the source and destination components as shown in Figure 10 5 170 CHAPTER 10 ronne ir Togi BEES INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES a Control Flow ES Data Fow Data Flow Parameters F EventHandlers e Package Explorer Data Flow Task R Data Flow Task Add your Source and Destination Components here Figure 10 5 The Data Flow tab Creating the Source Hive Connection Now it s time to create the connection to the Hive source First create a connection manager that will connect to your Hive data tables hosted in HDInsight You will use an ADO NET connection which will use the Data Source HadoopOnAzure you created on Chapter 8 to connect to Hive To create the connection right click in the Connection Managers section in the project and select New ADO Net Connection as shown in Figure 10 6 y Connection Managers D o Work Offline New OLE DB Connection New Flat File Connection New ADO NET Connection New Analysis Services Connection New File Connection New Connection Copy Ctrl C Paste Ctrl V Delete Del Rename Properties Alt Enter Right click here to add a new connection manager to the SSIS package Figure 10 6 Creating a new
193. esult is very high bandwidth network connectivity for storage clients This new network design enables MapReduce scenarios that can require significant bandwidth between compute and storage Microsoft plans to continue to invest in improving bandwidth between compute and storage as well as increase the scalability targets of storage accounts and partitions as time progresses Figure 2 3 shows a conceptual view of Azure ENS interfacing between blob storage and the HDInsight compute nodes Azure BLOB Storage findows Azure Flat Network Storage FNS x Le eg Ben ei e e A d d Fi j S i l l j J fe bh K d S A Ng wr d E e HDinsight Compute Nodes ae Figure 2 3 Azure Flat Network Storage 21 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE Summary In this chapter you read about the Windows Azure HDInsight service You had a look into subscribing to the HDInsight service which defaults to the Windows Windows Azure Storage Blob WASB as the data repository rather than to HDFS as in traditional Hadoop This chapter covered the benefits of using WASB as the storage media in the cloud and it mentioned some available tools for uploading data to WASB Also discussed was the brand new Azure Flat Network Storage FNS designed specifically for improved network bandwidth and throughput 22 CHAPTER 3 Provisioning Your HDInsight Service Clust
194. eta Connections dateand From From Data Fre LI Date Database Service Figure 9 3 PowerPivot for Excel 148 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS Because you are using the Hive ODBC driver choose Others OLEDB ODBC and click Next on the Table Import Wizard as shown in Figure 9 4 Table Import Wizard 5 _2 iea Connect toa Data Source You can either create a connection to a data source or you can use one that already exists Microsoft Access Create a connection to a Microsoft Access database Import tables or views from the database or data returned from a query Oracle Create a connection to an Oracle database Import tables or views from the database or data returned from a query Teradata Create a connection to a Teradata database Import tables or views from the database or data returned from a query Sybase Create a connection to a Sybase database Import tables or views from the database or data returned from a query Informix Create a connection to an Informix database Import tables or views from the database or data returned from a query IBM DB2 Create a connection to a DB2 database Import tables or views from the database or data returned from a query Others OLEDB ODBC Create a connection to a data source by using an OLE DB provider or an OLE DB for ODBC provider Import data from the tables or views that are returned by the Multidimensional Sources
195. f words and their counts we would display only the words that have the string human in it As you continue to develop your script based framework for job submissions it becomes increasingly difficult to manage it without a standard editor The Windows Azure PowerShell kit provides you with a development environment called Windows PowerShell ISE which makes it easy to write execute and debug PowerShell scripts Figure 5 6 shows you a glimpse of PowerShell ISE It has built in Intellisense and autocomplete features for your variable or method names that comes into play as you type in your code It also implements a standard coloring mechanism that helps you visually distinguish between the different PowerShell object types 81 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Led DA gp ka RH oe Ge eloo ha Submitlob psi X Invoke Hive ps1 Untitled1 ps1 1 subscription Your_subscription_Name 2 cluster democluster 3 storageAccountName democluster 4 Container democlustercontainer 5 stor ageAccountKey Get AzureStorageKey storageAccountName _ Primary 6 storageContext New AzureStorageContext StorageAccountName SstorageAccountName StorageAccc 7 inputPath wasb example data gutenberg davinci txt 8 SoutputPath wasb example data WordCountOutputPS 9 S jarFile wasb example jars hadoop examples jar 10 class wordcount KA LA Spasswd ConvertTo SecureString
196. faultStorageContainer Constants container UserName Constants clusterUser Password Constants clusterPassword ClusterSizeInNodes 2 Version 2 1 Console Write Creating cluster var clusterDetails client CreateCluster clusterInfo Console Write Done n ListClusters Delete an existing HDI cluster public static void DeleteCluster var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds Console Write Deleting cluster client DeleteCluster AutomatedHDICluster Console Write Done n ListClusters Run Custom Map Reduce public static void DoCustomMapReduce Console WriteLine Starting MapReduce job Log in remotely to your Name Node and check progress from JobTracker portal with the returned JobID IHadoop hadoop Hadoop Connect Constants azureClusterUri Constants clusterUser Constants hadoopUser Constants clusterPassword Constants storageAccount Constants storageAccountKey Constants container true var output hadoop MapReduceJob ExecuteJob lt SquareRootJob gt Run Sample Map Reduce Job public static void DoMapReduce Define the MapReduce job MapReduceJobCreateParameters mrJobD
197. figuration Tools A Integration Services A Performance Tools Figure 10 1 SQL Server data tools Create a new project and choose Integration Services Project in the New Project dialog as shown in Figure 10 2 168 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES New Project l 9 S Recent Templates MET Framework A Sort by Default x Search Installed Templates EI red Templates i Type Business Intelligence y 3 d Integration Services Project Business Intelligence 4 Business Intelligence a This project may be used for building high Analysis Services K r g i performance data integration and Integration Services K Integration Services Import Project Wizard Business Intelligence workflow solutions including extraction Rancitiria Sai transformation and loading ETL EE SP operations for data warehousing Visual Basic Visual CS Visual C Visual Fe SQL Server Other Project Types Database Modeling Projects Test Projects Name HiveConsumer Location E SSIS Sample Projects DH Solution name HiveConsumer V Create directory for solution Add to source control Figure 10 2 New SSIS project When you select the Integration Services Project option an SSIS project with a blank package named Package dtsx is created This package is visible in the Solution Explorer window of the project as shown in Figure 10 3 bod Solution HiveConsumer 1 project 4 Vi HiveConsumer Q Project p
198. files in the container e Container in a different storage account with the public container or the public blob access level You have read only permission to the files in the container e Container in a different storage account with the private access levels You must add a new entry for each storage account to the C apps dist hadoop 1 1 0 SNAPSHOT conf core site xml file to be able to access the files in the container from HDInsight as shown in Listing 2 1 Listing 2 1 Accessing a Blob Container from a Different Storage Account lt property gt lt name gt fs azure account key lt YourStorageAccountName gt blob core microsoft com lt name gt lt value gt lt YourStorageAccountkeyValue gt lt value gt lt property gt Caution Accessing a container from another storage account might take you outside of your subscription s data center You might incur additional charges for data flowing across the data center boundaries 17 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE Understanding the Windows Azure Storage Blob HDInsight introduces the unique Windows Azure Storage Blob WASB as the storage media for Hadoop on the cloud As opposed to the native HDFS the Windows Azure HDInsight service uses WASB as its default storage for the Hadoop clusters WASB uses Azure blob storage underneath to persist the data Of course you can choose to override the defaults and set it back to HDFS but there are some advantages
199. from the Windows Azure Management Portal from your NET or PowerShell clients as well as file system operations from the Hadoop Command Line Because these operations might incur additional cost in terms of using storage space logging and monitoring are turned off by default for your storage account Monitoring and logging can be enabled on all three types of storage blobs tables and queues You can specify one of three available monitoring levels e Off e Minimal e Verbose Similarly you can set the logging activities on your storage to one of three levels e Read Requests e Write Requests e Delete Requests Navigate to your storage account in the Azure Management portal Click on the Configure link and choose the desired level of logging as shown in Figure 11 7 201 CHAPTER 11 LOGGING IN HDINSIGHT wn B oo versos Retention in days o specify 0 if you do not want to set a retention policy testahmed Retention in days bk specify 0 if you do not want to set a retention policy hadooponcloud hdidemo hdinsightstorage QUEUES OFF MINIMAL VERBOSE Retention in days KT specify 0 if you do not want to set a retention policy logging BLOBS Read Requests CT Write Requests CO Delete Requests Retention in days ody specify 0 if you do not want to set a retention policy TABLES Read Requests C Write Requests CO Delete Requests Retention in days RJ specify Oif you do not want to set a
200. g commands to set them subid Get AzureSubscription Current SubscriptionId cert Get AzureSubscription Current Certificate Once they are set you can execute the following command to list your existing HDInsight clusters Get AzureHDInsightCluster SubscriptionId subid Certificate cert 52 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING Because I have two clusters I get the following output PS C gt Get AzureHDInsightCluster SubscriptionId subid Certificate cert Name datadork ConnectionUrl https datadork azurehdinsight net State Running CreateDate 8 16 2013 9 19 09 PM UserName admin Location East US ClusterSizeInNodes 4 Name democluster ConnectionUrl https democluster azurehdinsight net State Running CreateDate 6 26 2013 6 59 30 PM UserName admin Location East US ClusterSizeInNodes 4 To provision a cluster you need to specify a storage account The HDInsight cmdlets will need to get the key for your storage account dedicated to the cluster If you remember I am using my storage account called hdinsightstorage for all my clusters Issuing the following PowerShell command will populate the cmdlet variable with the storage account key key1 Get AzureStorageKey hdinsightstorage Primary On successful access to the storage account key you will see messages similar to the following ones PS C gt key1 Get AzureStorageKe
201. g language called PigLatin e Flume Provides a mechanism to import data into HDFS as data is generated e Sqoop Provides a mechanism to import and export data to and from relational database tables and HDFS CHAPTER 1 INTRODUCING HDINSIGHT e Oozie Allows you to create a workflow for MapReduce jobs e HBase Hadoop database a NoSQL database e Mahout A machine learning library containing algorithms for clustering and classification e Ambari A project for monitoring cluster health statistics and instrumentation Figure 1 3 gives you an architectural view of the Apache Hadoop ecosystem We will explore some of the components in the subsequent chapters of this book but for a complete reference visit the Apache web site at http hadoop apache org Business Intelligence Excel Powerview Data Access Layer ODBC SQOOP REST Stats Processing Metadata HCatalog Graph Pegasus RHadoop Scripting Pig Query Hive Machine Learning Log File Aggregation Mahout Distributed Processing Map Reduce Distributed Storage HDFS Figure 1 3 The Hadoop ecosystem NoSQL Database HBase 3 p D o v a As you can see deploying a Hadoop solution requires setup and management of a complex ecosystem of frameworks often referred to as a zoo across clusters of computers This might be the only drawback of the Apache Hadoop framework the complexity and efforts involved in cr
202. ge account is removed the cluster will no longer be available for use 30 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER NEW HDINSIGHT CLUSTER Storage Account STORAGE ACCOUNT Ka Use Existing Storage ACCOUNT NAME lt democluster DEFAULT CONTAINER Le democlustercontainer ADDITIONAL STORAGE ACCOUNTS Ka 0 Figure 3 11 Specifying the HDInsight cluster storage account Note The name of the default container is the same name as that of the HDinsight cluster In this case have pre created my container in the storage account which is democlustercontainer The CUSTOM CREATE wizard also gives you the option to specify multiple storage accounts for your cluster The wizard provides you additional storage account configuration screens in case you provide a value for the ADDITIONAL STORAGE ACCOUNTS drop down box as shown in Figure 3 11 For example if you wish to associate two more storage accounts with your cluster you can select the value 2 and there will be two more additional screens in the wizard as shown in Figure 3 12 31 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER Storage Account STORAGE ACCOUNT Use Existing Storage ACCOUNT NAME democluster DEFAULT CONTAINER democlustercontainer ADDITIONAL STORAGE ACCOUNTS Figure 3 12 Adding more storage accounts Finishing the Cluster Creation Click on Finish the c
203. ge that connects to Hive using the Microsoft Hive ODBC Driver and imports data from the Hive table stock_analysis to SQL Server Once the data is in SQL Server you can leverage warehousing solutions like Analysis Services to slice and dice the data and use Reporting Services for powerful reporting on the data This also enables you to integrate nonrelational data to be merged with traditional RDBMS data and extract information from it as a whole 185 CHAPTER 11 Logging in HDinsight A complex eco system like Hadoop must have a detailed logging mechanism to fall back on in case something goes wrong In traditional Hadoop all the services like NameNode JobTracker TaskTracker and so on have logging capabilities where each and every operation is logged right from service startup to shut down Apart from the services or daemons startup there are additional events that need to be recorded such as job requests interprocess communication between the services job execution history and others HDInsight distribution extends this logging mechanism by implementing its own As you know the entire cluster storage for the HDInsight service is in Azure in the form of blob containers So you need to know and rely on the Azure storage logs to track down any access or space limitation issues This chapter specifically focuses on the logging and instrumentation available for the Windows Azure based Hadoop services and also gives you a glimpse into the
204. ght Emulator provides you with a local development environment for the Windows Azure HDInsight Service It uses the same software bits as the Azure HDInsight service and is the test bed recommended by Microsoft for testing and evaluation Caution While it s technically possible create a multinode configuration of HDInsight emulator doing so is neither a recommended nor a supported scenario because it opens the door to serious security breaches in your environment If you are still eager to do the multinode configuration and you delete the firewall rule and modify the conf xm1 Hadoop config files you ll essentially be allowing anyone to run code on your machine and access your file system However such a configuration can be tested in a less sensitive lab environment solely for testing purposes and is documented in the following blog post http binyoga blogspot in 2013 07 virtual 1lab multi node hadoop cluster html Like the Azure service the emulator is also based on Hortonworks Data Platform HDP which bundles all the Apache projects under the hood and makes it compatible with Windows This local development environment for HDInsight simplifies the configuration execution and processing of Hadoop jobs by providing a PowerShell library with HDInsight cmdlets for managing the cluster and the jobs run on it It also provides a NET SDK for HDInsight for automating these procedures again much like the Azure service For users who nee
205. ground on Big Data and the current market trends This chapter has a brief overview of Apache Hadoop and its ecosystem and focuses on how HDInsight evolved as a product Chapter 2 Understanding Windows Azure HDInsight Service introduces you to Microsoft s Azure based service for Apache Hadoop This chapter discusses the Azure HDInsight service and the underlying Azure storage infrastructure it uses This is a notable difference in Microsoft s implementation of Hadoop on Windows Azure because it isolates the storage and the cluster as a part of the elastic service offering Running idle clusters only for storage purposes is no longer the reality because with the Azure HDInsight service you can spin up your clusters only during job submission and delete them once the jobs are done with all your data safely retained in Azure storage Chapter 3 Provisioning Your HDInsight Service Cluster takes you through the process of creating your Hadoop clusters on Windows Azure virtual machines This chapter covers the Windows Azure Management portal which offers you step by step wizards to manually provision your HDInsight clusters in a matter of a few clicks Chapter 4 Automating HDInsight Cluster Provisioning introduces the Hadoop NET SDK and Windows PowerShell cmdlets to automate cluster creation operations Automation is a common need for any business process This chapter enables you to create such configurable and automatic c
206. gs Dhadoop log file hadoop namenode RD00155D67172B log Dhadoop home dir c apps dist hadoop 1 2 0 1 3 1 0 06 Dhadoop root logger INFO console DRFA ETW FilterLog Djava library path c apps dist hadoop 1 2 0 1 3 1 0 06 Lib native Windows_NT amd64 64 c apps dist hadoop 1 2 0 1 3 1 0 06 lib native Dhadoop policy file hadoop policy xml Dcom sun management jmxremote Detwlogger component namenode Dwhitelist filename core whitelist res classpath c apps dist hadoop 1 2 0 1 3 1 0 06 conf c apps dist java lib tools jar c apps dist hadoop 1 2 0 1 3 1 0 063 c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop ant 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop client 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop core 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop core jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop examples 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop examples jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop minicluster 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop test 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop test jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop tools 1 2 0 1 3 1 0 06 jar3 c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop tools jar c apps dist hadoop 1 2 0 1 3 1 0 06 1lib c apps dist hadoop 1 2 0 1 3 1 0 06 lib jsp 2 1 3c apps dist log4jetwappender microsoft 1log4j etwappender 1 0 jar org apa
207. h streaming of data Big Data is also an optimum solution for processing historic data and performing trend analyses Finally if you decide you need a Big Data solution the next step is to evaluate and choose a platform There are several you can choose from some of which are available as cloud services and some that you run on your own on premises or hosted hardware This book focuses on Microsoft s Big Data solution which is the Windows Azure HDInsight Service This book also covers the Windows Azure HDInsight Emulator which provides a test bed for use before you deploy your solution to the Azure service www allitebooks com CHAPTER 1 INTRODUCING HDINSIGHT The Apache Hadoop Ecosystem The Apache open source project Hadoop is the traditional and undoubtedly most well accepted Big Data solution in the industry Originally developed largely by Google and Yahoo Hadoop is the most scalable reliable distributed computing framework available It s based on Unix Linux and leverages commodity hardware A typical Hadoop cluster might have 20 000 nodes Maintaining such an infrastructure is difficult both from a management point of view and a financial one Initially only large IT enterprises like Yahoo Google and Microsoft could afford to invest in such Big Data solutions such as Google search Bing maps and so forth Currently however hardware and storage costs are going so down This enables small companies or even consumers to th
208. h the Advanced Options page are Rows fetched per block Binary column length Decimal column scale usage of Secure Sockets Layer SSL certificates and so on as shown in Figure 8 9 141 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Advanced Options S Rows fetched per block Fast SQLPrepare Default string column length Binary column length Decimal column scale Async Exec Poll Interval ms SSL E Allow Common Name Host Name Mismatch Trusted Certificates C Program Files Microsoft Hive ODBC Driver ib cacerts pem Server Side Properties D Edit a V Apply server side properties with queries Co icone Figure 8 9 DSN Advanced Options dialog box Once the DSN is successfully created it should appear in the System DSN list as shown in Figure 8 10 142 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC T ODBC Data Source Administrator EH 7 SS User DSN System DSN File DSN Drivers Tracing Connection Pooling About _ System Data Sources Name Driver Microsoft Hive ODBC Driver Sample Microsoft Hive DSN Microsoft Hive ODBC Driver di An ODBC System data source stores information about how to connect to E the indicated data provider A System data source is visible to all users CG on this machine including NT services Figure 8 10 The HadoopOnAzure System DSN Note When you install the ODBC driver a sample DSN is automaticall
209. hName E SpanishMonthName E FrenchMonthName E MonthNumberOfYear CalendarQuarter CalendarYear M CalendarSemester M FiscalQuarter D FiscalYear E FiscalSemester 425 HDate lt lt FullDateAlternateKey FullDateAlternateKey EnglishMonthName EnglishMonthName CalendarYear CalendarYear CalendarQuarter CalendarQuarter Figure 9 16 Creating the hierarchy Next go back to the Data View and create measures for your stock table Select stock_price_open stock_price_high stock_price_low stock_price_close and choose Average under AutoSum Doing that will create measures with average calculations as shown in Figure 9 17 158 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS a 5 A b id Data Type Decimal Number 9 Sort Smallest to Largest wi P E gt RI Format General Z Sort Largest to Smallest x SM 3 sum From Data From Other Existing Refresh PivotTable Clear All Sort by Find L Service Sources Connections gt RA Ze Clear Sort Filters Column Get External Data Formatting Sort and Filter Find Count LG Distinct Count z stock_price_low_ z stock_price_ close ESEM OT E Max 12 00 00 AM 194 35 2490900 Min r 12 00 00 AM 193 22 3861000 195 16 t 12 00 00 AM 195 41 2856900 195 81 t 12 00 00 AM 194 49 3810000 195 04 T 12 00 00 AM 195 81 2663200 196 01 E Figure 9 17 Creating measures for stocks Go ahead and add another measure for stock_volume but this time make Sum th
210. hat generate the service log files For example Figure 7 7 shows the Hadoop log files as generated by the Emulator installation 119 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR Hadoop hadoop 1 1 0 SNAPSHOT logs Include in library v ret Sites A Share with v Burn New folder Big Name Date modified Type Size _ hadoop tasktracker PUMBAA log 10 30 2013 3 23 PM Text Document 27 581 KB _ hadoop datanode PUMBAA log 10 30 2013 1 43 PM Text Document 2KB _ hadoop namenode PUMBAA log 10 30 2013 12 43 Text Document 1 KB _ hadoop datanode PUMBAA Iog 2013 10 10 30 2013 11 43 2013 10 29 File 14 KB D hadoop namenode PUMBAA Iog 2013 1 10 30 2013 11 43 2013 10 29 File 18 Ke _ hadoop log 10 29 2013 7 58 PM Text Document 12 KB _ hadoop chmod PUMBAA log 10 29 2013 7 57PM Tet Document 0 KB _ hadoop mkdir PUMBAA log 10 29 2013 7 57 PM Text Document 0 KB _ hadoop historyserver PUMBAA log 10 29 2013 7 57 PM Tet Document 2 KB _ hadoop jobtracker PUMBAA log 10 29 2013 7 57 PM Text Document 6 KB _ hadoop secondarynamenode PUMBAALI 10 29 20137 57 PM Text Document 1KB _ hadoop format PUMBAA log 10 29 2013 7 55 PM Text Document 3 KB by history 10 29 2013 7 57 PM File folder Figure 7 7 Hadoop log files Note Details on HDInsight logging is explained in Chapter 11 By default the local emulator uses HDFS as its cluster storage This can be changed by modifying th
211. he application Because I have a couple of clusters deployed I see the output shown in Figure 4 6 when I execute the preceding code 46 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING og HadoopClient Running Microsoft Visual Studio FILE EDIT VIEW PROJECT BUILD DEBUG TEAM SQL TOOLS TEST ARCHITECTURE ANALYZE WINDOW luster datadork Nodes 4 luster democluster Nodes 4 var clusters client ListClusters foreach var item in clusters Figure 4 6 The ListClusters method You can use the CreateCluster method of the SDK to programmatically deploy your HDInsight cluster You will need to provide few mandatory parameters such as cluster name location storage account and so on while calling the CreateCluster method Listing 4 3 contains the code block to provision a new cluster with two data nodes through NET code Listing 4 3 The CreateCluster Method public static void CreateCluster var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds Cluster information var clusterInfo new ClusterCreateParameters Name AutomatedHDICluster Location East US DefaultStorageAccountName Constants storageAccount DefaultStorageAccountKey Constant
212. heck mark button to complete the cluster creation process It will take up to several minutes to provision the name node and the data nodes depending on your chosen configuration and you will see several status messages like one shown in Figure 3 13 throughout the process Creation of cluster democluster is complete Creating Storage Submitting Accepted Cluster Storage Provisioned Windows Azure VM Configuration HDInsight Configuration Running Figure 3 13 Cluster creation in process 32 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER Eventually the cluster will be provisioned When it is available its status is listed as Running as shown in Figure 3 14 NAME STATUS SUBSCRIPTION NAME LOCATION democluster gt A Running l East US Figure 3 14 An HDInsight cluster that s ready for use Monitoring the Cluster You can click on democluster which you just created to access your cluster dashboard The dashboard provides a quick glance of the metadata for the cluster It also gives you an overview of the entire cluster configuration its usage and so on as shown in Figure 3 15 At this point your cluster is fresh and clean We will revisit the dashboard later after the cluster is somewhat active and check out the differences democluster ZA DASHBOARD MONITOR CONFIGURATION SDPHDI ACTIVE MAP TASKS CFDS8ES E ACTIVE REDUCE TASKS CFDS RELATIVE v ONEHOUR vJ datadork tutorial 6
213. hed per block The recommendation is to keep it at 10 000 HiveServerType The HDiInsight default is 2 AuthMech Authentication mechanism You ll want to use a value of 6 which maps to using the username and password you specified when the cluster was created or a value of 3 to connect to the Emulator DefaultStringColumnLength The default length for STRING columns A sample connection string using an ODBC DSN named HDISamp1le should look like this Provider MSDASOQL 1 Password Persist Security Info True User ID admin Data Source HDISample Initial Catalog HIVE Note that there are only a few mandatory parameters that need to be passed in the connection string such as Provider Data Source User ID and Password The rest of the details like Port Number and Authentication Mechanism are embedded in the DSN itself and should be correctly provided while creating the DSN Summary Hive acts as a data warehouse on top of HDFS WASB in case of HDInsight providing an easy and familiar SQL like query language called HQL to fetch the underlying data HQL queries are broken down into MapReduce code internally relieving the end user from writing complex MapReduce code The Hive ODBC driver acts as an interface between client consumers and HDInsight enabling access from any tool supporting ODBC In this chapter you learned about creating and working with Hive tables as well as configuring and connecting to Azure HDInsight Service a
214. heet in a formatted way There is no need to access the data through Hive and there is no dependency on the Microsoft Hive ODBC Driver as with PowerPivot or Power View You can repeat the preceding steps to connect to different types of data sources and integrate them in your Excel sheet The integrated data can then be filtered and shaped to create curated data models targeting specific business requirements Summary In this chapter you learned how to integrate Microsoft self service BI tools with HDInsight to consume data and generate powerful visualizations of the data With the paradigm shifts in technology the industry is trending toward an era in which Information Technology will be a consumer product An individual person will be able to visualize the insights he needs to an extent from a client side add in like Power View You also had a peep into the Power BI tools that are available from Microsoft to provide data mash ups and 3 D visualizations of your data These self service BI tools provide the capability of connecting and talking to a wide variety of data sources seamlessly and creating in memory data models that combine the data from these diverse sources for powerful reporting 166 CHAPTER 10 Integrating HDInsight with SQL Server Integration Services Microsoft SQL Server is a complete suite of tools that include a relational database management system RDBMS multidimensional online analytical processing OLAP and ta
215. ht HDInsight is Microsoft s distribution of Hadoop on Windows Microsoft has embraced Apache Hadoop to provide business insight to all users interested in tuning raw data into meaning by analyzing all types of data structured or unstructured of any size The new Hadoop based distribution for Windows offers IT professionals ease of use by simplifying the acquisition installation and configuration experience of Hadoop and its ecosystem of supporting projects in Windows environment Thanks to smart packaging of Hadoop and its toolset customers can install and deploy Hadoop in hours instead of days using the user friendly and flexible cluster deployment wizards This new Hadoop based distribution from Microsoft enables customers to derive business insights on structured and unstructured data of any size and activate new types of data Rich insights derived by analyzing Hadoop data can be combined seamlessly with the powerful Microsoft Business Intelligence Platform The rest of this chapter will focus on the current data mining trends in the industry the limitations of modern day data processing technologies and the evolution of HDInsight as a product What Is Big Data and Why Now All of a sudden everyone has money for Big Data From small start ups to mid sized companies and large enterprises businesses are now keen to invest in and build Big Data solutions to generate more intelligent data So what is Big Data all about In my opinio
216. ht name node 89 Hadoop see Hadoop installation directory 110 remote desktop access connect option 90 enable 90 91 shortcuts 91 user account 90 windows services 109 110 HDInsight service cluster Azure storage account 23 cluster creation customization 28 244 cluster user Hive Oozie storage configuration 29 configuration deleting the cluster 37 DISABLE REMOTE button 36 ENABLE REMOTE 35 Hadoop services 35 remote desktop 36 screen 35 creation 32 33 deployment cluster details 28 CREATE AN HDINSIGHT CLUSTER 27 CUSTOM CREATE option 28 QUICK CREATE 28 monitor dashboard 33 dashboard refresh rate setting 34 MONITOR option 34 SQL Azure database creation CUSTOM CREATE option 27 Hive and Oozie data stores 26 MetaStore SQL Azure database 27 options 26 QUICK CREATE option 26 storage account creation enable geo replication 25 hdinsightstorage account 26 multiple accounts 23 new storage account 25 QUICK CREATE 25 storage account details 25 Windows Azure Management Portal 23 24 storage account selection 30 Windows Azure blob storage 23 Hive command failure compress intermediate file 232 hive log 227 HQL errors 228 229 JobTracker Log 231 232 map joins implementation 233 MapReduce Operation Log 230 query execution phases 229 Reducer Task Size configuration 233 Hive MetaStore 137 Hive ODBC drivers 127 architecture 129 DSN less connection 144 Hadoop ecosystem 128 installation 137
217. ight Emulator sssunnnsnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 113 Chapter 8 Accessing HDInsight over Hive and ODBC sssunnusnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 127 Chapter 9 Consuming HDinsight from Self Service BI ToolS ssssssssnnnnnnnnnnnnnnnnnnnnnnnnnn 147 Chapter 10 Integrating HDinsight with SQL Server Integration Services 167 Chapter 11 Logging in HDinsight ssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 187 Chapter 12 Troubleshooting Cluster Deployment ssssssnnnssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 205 Chapter 13 Troubleshooting Job Failures ssssnnsssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 219 JOE eeegesgeeen eegeehEeE eege REENEN 243 v www allitebooks com Introduction My journey in Big Data started back in 2012 in one of our unit meetings Ranjan Bhattacharjee our boss threw in some food for thought with his questions Do you guys know Big Data What do you think about it That was the first time I heard the phrase Big Data His inspirational speech on Big Data Hadoop and future trends in the industry triggered the passion for learning something new in a few of us Now we are seeing results from a historic collaboration between open source and proprietary products in the form of Microsoft HDInsight Microsoft and Apache have joined hands in an effort to make Hadoop available on Windows and HDInsight is the result I am a big fan of such i
218. igure 9 8 Selecting the table The Hive table with all the rows should get successfully loaded in the PowerPivot model as shown in Figure 9 9 _ ctengen Ze Importing The import operation might take several minutes to complete To stop the import operation click the Stop Import button s Total 1 Cancelled 0 W Success 1 Error 0 Details Status stock_analysis Success 36 153 rows transferred Figure 9 9 Finishing the import 152 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS Close the Table Import Wizard You should see the PowerPivot model populated with data from the stock_analysis table in Hive as shown in Figure 9 10 RE Home Design Advanced e s nn FS Data Type Text gt Autosum LI le L L Format Text d X o it E Create KPI Paste Ea ea Refresh S gt 582 D lene An eons Find Clipboard Formatting Sort and Filter Find Calculations stock_symbol E Ge IBM 5 8 2013 ge 16 a 88 ET 195 5 2490900 195 E E IBM 2 8 2013 195 5 195 5 193 22 195 16 3861000 195 IBM 1 8 2013 196 65000000000001 197 16999999999999 195 41 195 81 2856900 195 o IBM 31 07 2013 194 49000000000001 196 91 194 49000000000001 195 03999999999999 3810000 195 IW IBM 30 07 2013 196 99000000000001 197 83000000000001 195 81 196 00999999999999 2663200 196 EI IBM 29 07 2013 196 83000000000001 197 19 195 53 196 21000000000001 2113700 196 E IBM 26 07 2013 196 59 197 37 195 197 3499999
219. ilt in logging capability for example the Microsoft Hive ODBC driver that is developed through partnership with Simba In such scenarios you can use the standard ODBC logging mechanism from ODBC Data Source Administrator The only difference here is that the standard mechanism is system wide ODBC tracing for all ODBC drivers that are installed on your system as opposed to only the Hive ODBC driver Note Enabling system wide tracing from ODBC Data Source Administrator can significantly reduce performance of applications relying on ODBC function calls To enable system wide ODBC tracing launch the ODBC Data Source Administrator from Control Panel or click on Start gt Run gt odbcad32 exe Navigate to the Tracing tab and click on Start Tracing Now as shown in Figure 11 6 198 CHAPTER 11 LOGGING IN HDINSIGHT User DSN System DSN File DSN Drivers Tracing Connection Pooling About When to trace C Machine Wide tracing for all user identities Log File Path Custom Trace DLL C NODBC Jog C windows system32 odbetrac dil Browse Select DLL ODBC tracing allows you to create logs of the calls to ODBC drivers for use by support personnel or to EN aid you in debugging your applications Figure 11 6 Windows ODBC tracing You need to select the Log File Path to write the logs to The Custom Trace DLL field should be pre populated with the Windows defined tracing dll and need not be changed By default it is set
220. ime of your data and metadata from the lifetime of the cluster There is a great sample script to provision an HDInsight cluster using custom configuration available at http www windowsazure com en us documentation articles hdinsight provision clusters 87 CHAPTER 6 Exploring the HDInsight Name Node The HDInsight name node is just another virtual machine provisioned in Windows Azure Theoretically this is the equivalent of the traditional Apache Hadoop name node or the head node which is the heart and soul of your Hadoop cluster I would like to re iterate what I pointed out in Chapter 1 the name node is the single point of failure in a Hadoop cluster Most important of all the name node contains the metadata of the entire cluster storage blocks and maintains co ordination among the data nodes so understandably it could bring down the entire cluster Note There is a Secondary Name Node service ideally run on a dedicated physical server that keeps track of the changed HDFS blocks in the name node and periodically backs up the name node In addition you can fail over to the secondary name in the unlikely event of a name node failure but that failover is a manual process The HDInsight Service brings a significant change from the traditional approach taken in Apache Hadoop It does so by isolating the storage to a Windows Azure Storage Blob instead of to the traditional Hadoop Distributed File System HDFS that is local to
221. in Figure 3 21 Remote desktop is enabled for democluster Request in progress Completed Successfully Figure 3 21 Remote Desktop is enabled You can come back to the cluster configuration screen anytime you wish to disable Remote Desktop access Do that via the DISABLE REMOTE button shown in Figure 3 22 DN x CONNECT DISABLE REMOTE Figure 3 22 Disable Remote Desktop 36 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER Once you are done with your cluster you can choose to delete the cluster by pressing the DELETE button in the configuration screen Figure 3 22 shows that button too Once the cluster deletion process is complete you will see status messages similar to those in Figure 3 23 Cluster democluster was deleted HDinsight Cluster Queued for Deletion Deleting a new Figure 3 23 Deleting the cluster Summary This chapter gets you started using the Windows Azure HDInsight Service which makes Apache Hadoop available as a service in the cloud You saw how to provision your Hadoop clusters in the cloud using the simple wizards available in the Azure Management Portal You also saw how to create a dedicated storage account and associate it with the cluster that is used as the default file system by HDInsight 37 CHAPTER 4 Automating HDInsight Cluster Provisioning It is almost always a requirement for a business to automate activities that are repetitive and can be predicted wel
222. indows Azure Windows Azure is an open and flexible cloud platform that enables you to quickly build deploy and manage applications across a global network of Microsoft managed datacenters Easily create web sites virtual machines or databases in a few clicks Try it free now Download progress Microsoft HDInsight Emulator for Windows Azure 242 KB sec J Install progress Installing Hortonworks Data Platform for Windows 1 out of 3 49 Cancel K S Figure 7 2 Installing HDP Note The HDinsight Emulator supports only the 64 bit flavor of the Windows OS family Verifying the Installation Once the installation is complete you can confirm if it is successful by verifying the presence of the Hadoop portal shortcuts on your desktop Much like the Azure HDInsight name node the emulator places the shortcuts to the Name Node status the MapReduce status portals and the Hadoop Command Line on the desktop as shown in Figure 7 3 116 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR Hadoop Sbialkisls MVapRed0 mam Volley Om Figure 7 3 Hadoop portals You can also confirm the installation status from the Control Panel gt Programs and Features snap in You are good if you find HDInsight Emulator and HDP in the list of installed programs as shown in Figure 7 4 E Hortonworks Data Platform 1 1 Developer Hortonworks m Microsoft HDInsight Emulator for Windows Azure Microsoft Corporation Figure 7 4
223. ing load data inpath wasb democlustercontainer democluster blob core windows net debarchan StockData tableMSFT csv into table stock_analysis partition exchange NASDAQ HiveJobDefinition New AzureHDInsightHiveJobDefinition Query querystring HiveJob Start AzureHDInsightJob Subscription subscriptionname Cluster clustername JobDefinition HiveJobDefinition HiveJob Wait AzureHDInsightJob Subscription subscriptionname WaitTimeoutInSeconds 3600 Get AzureHDInsightJobOutput Cluster clustername Subscription subscriptionname JobId HiveJob JobId StandardError Note You may need to wrap up each of the commands in single line to avoid syntax errors depending on the PowerShell editor you use You should see output similar to the following once the job completes StatusDirectory O0b2e0a0b e89b 4f57 9898 3076c10fddc3 ExitCode 0 Name Hive load data inpath wa Query load data inpath wasb democlustercontainer democluster blob core windows net debarchan StockData tableMSFT csv into table stock_analysis partition exchange NASDAQ 134 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC State Completed SubmissionTime 11 24 2013 7 35 18 AM Cluster democluster PercentComplete JobId d job_201311240635_ 0006 Logging initialized using configuration in file C apps dist hive 0 11 0 1 3 1 0 06 cont hive log4j properties Loading data to table default stock_analysis partition excha
224. ing SQL Server to be the source or the destination For instance SSIS can be used to move data from an FTP server to a local flat file SSIS also provides a workflow engine to automate various tasks data flows task executions and so forth that are executed in an ETL job An SSIS package execution can itself be one step that is part of an SQL Agent job and SQL Agent can run multiple jobs independent of each other An SSIS solution consists of one package or more each containing a control flow to perform a sequence of tasks Tasks in a control flow can include calls to web services FTP operations file system tasks the automation of command line commands and others In particular a control flow usually includes one or more data flow tasks which encapsulate an in memory buffer based pipeline of data from a source to a destination with transformations applied to the data as it flows through the pipeline An SSIS package has one control flow and as many data flows as necessary Data flow execution is dictated by the content of the control flow Detailed discussion on SSIS and its components are outside the scope of this book In this chapter I assume you are familiar with basic SSIS package development using Business Intelligence Development Studio BIDS in SQL Server 2005 2008 2008 R2 or SQL Server Data Tools in SQL Server 2012 If you are a beginner in SSIS I recommend that you read one of the many good introductory SSIS books available as a
225. ing storage is no longer a practical option Volume Big Data solutions typically store and query thousands of terabytes of data and the total volume of data is probably growing by ten times every five years Storage solutions must be able to manage this volume be easily expandable and work efficiently across distributed systems www allitebooks com CHAPTER 1 INTRODUCING HDINSIGHT Velocity Data is collected from many new types of devices from a growing number of users and an increasing number of devices and applications per user Data is also emitted at a high rate from certain modern devices and gadgets The design and implementation of storage and processing must happen quickly and efficiently Figure 1 1 gives you a theoretical representation of Big Data and it lists some possible components or types of data that can be integrated together WHAT IS BIG DATA Petabytes Click stream Sensors RFID Social sentiment Big Data Wikis blogs devices Audio video Log files Advertisin i Spatial amp Terabytes fe Collaboration Web 2 0 cps pen tara Mobile eCommerce Data market feeds Web Logs eGov feeds Digital Marketin Ee r O i g Weather Search Marketing Text image Recommendations Megabytes a Data Complexity Variety and Velocity Figure 1 1 Examples of Big Data and Big Data relationships There is a striking difference in the ratio between the speeds at which data is generated compared to the speed at which it i
226. ink about using a Big Data solution Because this book covers Microsoft HDInsight which is based on core Hadoop we will first give you a quick look at the Hadoop core components and few of its supporting projects The core of Hadoop is its storage system and its distributed computing model This model includes the following technologies and features e HODES Hadoop Distributed File System is responsible for storing data on the cluster Data is split into blocks and distributed across multiple nodes in the cluster e MapReduce A distributed computing model used to process data in the Hadoop cluster that consists of two phases Map and Reduce Between Map and Reduce shuffle and sort occur MapReduce guarantees that the input to every reducer is sorted by key The process by which the system performs the sort and transfers the map outputs to the reducers as inputs is known as the shuffle The shuffle is the heart of MapReduce and it s where the magic happens The shuffle is an area of the MapReduce logic where optimizations are made By default Hadoop uses Quicksort afterward the sorted intermediate outputs get merged together Quicksort checks the recursion depth and gives up when it is too deep If this is the case Heapsort is used You can customize the sorting method by changing the algorithm used via the map sort class value in the hadoop default xml file The Hadoop cluster once successfully configured on a system has the follo
227. input records 0 0 0 Physical memory bytes snapshot 163 479 552 0 163 479 552 Spilled Records 0 0 0 Manege Total committed heap usage bytes l 514 523 136 d 514 523 136 CPU time spent ms 5 983 0 5 983 Virtual memory bytes snapshot 653 365 248 0 653 365 248 Figure 5 5 MapReduce job details Behind the scenes an HDInsight cluster exposes a WebHCat endpoint WebHCat is a Representational State Transfer REST based API that provides metadata management and remote job submission to the Hadoop cluster WebHCat is also referred to as Templeton For detailed documentation on Templeton classes and job submissions refer to the following link http docs hortonworks com HDPDocuments HDP1 HDP Win 1 1 ds_Templeton index html 68 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Submitting the wordcount MapReduce Job The NET SDK for HDInsight also provides simpler ways to execute your existing MapReduce programs or MapReduce code written in Java In this section you will submit and execute the sample wordcount MapReduce job and display the output from the blob storage First let s add a helper function that will wait and display a status when the MapReduce job is in progress This is important because the MapReduce function calls might not be symmetric and you might see incorrect or intermediate output if you fetch the blob storage when the job execution is in progress Add the WaitForJobCompletion method to yo
228. integrate data from heterogeneous data sources through an easy to use graphical interface It is available as an add in to Excel after you download it from the following web site http www microsoft com en us download details aspx id 39379 Power Map is an upcoming offering that previously was known as GeoFlow Power Map can be used together with Power Query to create stunning three dimensional visualizations of coordinates plotted over Bing maps Learn more about Power Map from the following article on Microsoft Developer Network http blogs msdn com b powerbi archive 2013 07 07 getting started with pq and pm aspx In this section you will use Power Query to connect to your Windows Azure HDInsight Service and load data from HDFS to your Excel worksheet Begin from the Excel toolbar ribbon by clicking Power Query gt From Other Sources gt From Windows Azure HDInsight as shown in Figure 9 23 163 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS HOME INSERT PAGELAYOUT FORMULAS DATA REVIEW VIEW POWER QUERY Q DO G ESTen D Bo Online From From From From Merge Append Data Source Search Web Filey Database __ Table L i Settings Get External Data From SharePoint List 200k Settings Machine Settings 3 4 Import data from a Microsoft F G2 S gt v SharePoint site A B Le From OData Feed H I J E Import data from an OData feed 2 3 s From Windows Azure Marketplace M 4 es Import data f
229. into Big Data and the Hadoop world He is an SME in SQL Server Integration Services and is passionate about the present day Microsoft self service BI tools and data analysis especially social media brand sentiment analysis Debarchan hails from the city of joy Calcutta India and is presently located in Bangalore India for his job in Microsoft s Global Technical Support Center Apart from his passion for technology he is interested in visiting new places listening to music the greatest creation ever on Earth meeting new people and learning new things because he is a firm believer that Known is a drop the unknown is an ocean Ona lighter note he thinks it s pretty funny when people talk about themselves in the third person xiii About the Technical Reviewers Rodney Landrum went to school to be a poet and a writer And then he graduated so that dream was crushed He followed another path which was to become a professional in the fun filled world of Information Technology He has worked as a systems engineer UNIX and network admin data analyst client services director and finally as a database administrator The old hankering to put words on paper while paper still existed got the best of him and in 2000 he began writing technical articles some creative and humorous some quite the opposite In 2010 he wrote The SQL Server Tacklebox a title his editor disdained but a book closest to the true creative potenti
230. ironment for use in testing and evaluating your solution before deploying it to the cloud You save money by not paying for Azure hosting until after your solution is developed and tested and ready to run The emulator is available for free and will continue to be a single node offering While keeping all these details about Big Data and Hadoop in mind it would be incorrect to think that HDInsight is a stand alone solution or a separate solution of its own HDInsight is in fact a component of the Microsoft Data Platform and part of the company s overall data acquisition management and visualization strategy Figure 1 4 shows the bigger picture with applications services tools and frameworks that work together and allow you to capture data store it and visualize the information it contains Figure 1 4 also shows where HDInsight fits into the Microsoft Data Platform CHAPTER 1 INTRODUCING HDINSIGHT Reporting and Analysis SQL Server Reporting Services PowerPivot PowerView PowerQuery PowerMap al SharePoint Server SharePoint Online Corporate Data Models SQL Server Analysis Services dl Multidimensional amp Daman Tabular models Big Data Windows Azure HDinsight Emulator Windows Azure HDinsight Service Data Sources Business S Applications e Device sensors amp streaming data feeds Figure 1 4 The Microsoft data platform Data Stores SQL Server Integration Services Semi structured and
231. is book wouldn t have been a reality Thanks Andy for trusting me and making it possible for me to realize my dream I truly appreciate the great work you and Linchpin People are doing for the SQL Server and BI community helping SQL Server to be a better product each day Thanks to the folks at Apress Ana and Jonathan for their patience Roger for his excellent accurate and insightful copy editing and Rodney and Scott for their supportive comments and suggestions during the author reviews I would also like to thank two of my colleagues Krishnakumar Rukmangathan for helping me with some of the diagrams for the book and Amarpreet Singh Bassan for his help in authoring the chapters on troubleshooting You guys were of great help Without your input it would have been a struggle and the book would have been incomplete Last but not least I must acknowledge all the support and encouragement provided by my good friends Sneha Deep Chowdhury and Soumendu Mukherjee Though you are experts in completely different technical domains you guys have always been there with me listening patiently about the progress of the book the hurdles faced and what not from the beginning to the end Thanks for being there with me through all my blabberings xvii
232. iteLine Cluster Created ListClusters public static void DeleteCluster var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds client DeleteCluster AutomatedHDICluster console WriteLine Cluster Deleted ListClusters Windows Azure also exposes a set of PowerShell cmdlets for HDInsight to automate cluster management and job submissions You can consider cmdlets as prebuilt PowerShell scripts that do specific tasks for you The next section describes the PowerShell cmdlets for HDInsight for cluster provisioning 50 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING Using the PowerShell cmdlets for HDInsight The first step is to install the PowerShell cmdlets for HDInsight from the following URL http www microsoft com en sg download details aspx id 40724 When prompted save and unzip the zip files to a location of your choice In my case I chose my Visual Studio solution folder as shown in Figure 4 9 Computer r DATA D HadoopClient v Search HadoopClient A Name e Date modified T A HadoopClient 9 7 2013 8 16 AM Fi A packages 9 7 2013 8 16 AM Fi Di HadoopClient sin 9 7 2013 8 03 AM N 49 HadoopClient v11 suo L Microsoft WindowsAzure Ma
233. ith the appropriate connection string as shown in Figure 9 6 Click on Next 150 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS mengen Legd Specify a Connection String Type or paste a connection string A connection string contains the information needed to connect to a particular data source Friendly name for this connection HiveConnection Connection String Provider MSDASQL 1 Persist Security Info True User ID admin lnitial Catalog HIVE DSN HadoopOnAzure Password Figure 9 6 Configuring the connection string We are going to choose to import from the Hive table directly but we can also write a query HiveQL to fetch the data as shown in the Figure 9 7 tienen rss Choose How to Import the Data You can either import all of the data from tables or views that you specify or you can write a query using SQL that specifies the data to import Select from a list of tables and views to choose the data to import Write a query that will specify the data to import Figure 9 7 Select the table or write the query 151 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS Select the stock_analysis table and click Finish to complete the configuration as shown in Figure 9 8 Select Tables and Views Select the tables and views that you want to import data from Catalog HIVE Tables and Views E hivesampletabl stock _analysis E HIVE_SYSTEM F
234. jobs that need to be executed The errors reported in this stage or after this are MapReduce job errors Further insights can be gained about these failures from the TaskTracker log files in the compute nodes 229 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Hive SELECT commands with aggregate functions count sum and so on or having conditions with column filters invoke MapReduce jobs to get the command output For example if you execute the query SELECT count from hivesampletable you would see output with MapReduce job details as shown in Listing 13 11 Listing 13 11 MapReduce Operation Log Total MapReduce jobs 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time 1 In order to change the average load for a reducer in bytes set hive exec reducers bytes per reducer lt number gt In order to limit the maximum number of reducers set hive exec reducers max lt number gt In order to set a constant number of reducers set mapred reduce tasks lt number gt Starting Job job_201311120315_0003 Tracking URL http jobtrackerhost 50030 jobdetails jsp jobid job_201311120315_0003 Kill Command c apps dist hadoop 1 2 0 1 3 0 1 0302 bin hadoop cmd job kill job_201311120315 0003 Hadoop job information for Stage 1 number of mappers 1 number of reducers 1 2013 11 16 17 28 38 336 Stage 1 map 0 reduce 0 2013 11 16 17 28 42 354 Stage 1 map 100 reduce 0 Cumulative CPU 3 093 sec
235. kages 10 29 2013 7 26 AM File folder do HadoopSetupTools 10 29 2013 7 22AM _ File folder Figure 12 1 HadoopInstallFiles directory 207 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS The HadoopPackages folder contains the zipped Hortonworks Data Platform HDP which is basically a bundle of Hadoop and its supporting projects The HadoopSetupTools folder contains the install uninstall logs and the command files to initiate the installation or uninstallation It also contains the command file and PowerShell script for invoking the packaged HDP from the HadoopPackages directory as shown in Figure 12 2 ert Local Disk C HadooplnstallFiles HadoopSetupTools v n library v Share with v Burn New folder A Name Date modified T bootstrap_install cmd 10 22 2013 2 17PM V 39 bootstrap_uninstall cmd 10 22 _ hdp 1 0 1 winpkg install log 10 2 _ hdp 1 0 1 winpkg uninstall log 10 29 E winpkg cmd winpkg psl 10 2 99 9012 3 17 winpkg utils psm1 10 22 2013 2 17PM V Figure 12 2 HadoopSetupTools directory A typical install log file contains the messages during the installation process sequentially as each project is deployed A snippet of it looks similar to Listing 12 3 Listing 12 3 HDInsight install log WINPKG Logging to existing log C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log WINPKG ENV WINPKG BIN is C HadoopInstallFiles HadoopSetupTools WINPKG Setting Environment
236. kly explains how to create a SQL database on Azure which you would later use as storage for Hive and Oozie Create a new SQL Azure database from your Azure Management Portal Click on New gt Data Services gt SQL Database Figure 3 5 shows the use of the QUICK CREATE option to create the database DATABASE NAME SQL DATABASE E QUICK CREATE MetaStore STORAGE bd CUSTOM CREATE COMPUTE DATA SERVICES PTION SEW SE SQL REPORTING APP SERVICES NETWORKS JOwxu6qurg Location Southt Ba 0 RECOVERY SERVICES Ki558csx2h Location Southe ai New SQL database server STORE CREATE SQL DATABASE vV Figure 3 5 Creating a SQL Azure database 26 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER The choices in Figure 3 5 will create a database on your Azure data center with the name MetaStore It will be 1 GB in size and it should be listed in your Azure portal as shown in Figure 3 6 MetaStore SQL Database f Online democluster HDiInsight Cluster VW Running Figure 3 6 The MetaStore SQL Azure database You can further customize your database creation by specifying the database size collation and more using the CUSTOM CREATE option instead of the QUICK CREATE option You can see CUSTOM CREATE just under QUICK CREATE in Figure 3 5 You can even import an existing database backup and restore it as a new database using the IMPORT option in the wizard However you choose to create it you now have a dat
237. ksDWB dbo al P DatabaseLog dbo E E DimAccount dbo E CT o DimCurrency e E e DimCustomer DOERNER E DimDepartmentGroup dbo Jee bor ta Figure 9 14 Choosing the DimDate table If you do not have the AdventureWorksDWH database you can download it from the following link http msftdbprodsamples codeplex com releases view 55330 Note You will see a lot of sample database files available for download when you link to the site just mentioned For this chapter download the file that says AdventureWorksDW2012 Data File After the download is complete make sure you attach the database to your SQL Server instance You can do so using the SQL Server Attach Databases wizard or simply executing the following SQL statement EXEC sp attach single file db dbname AdventureWorksDWH filename lt path gt AdventureWorksDW2012_Data mdt Once the import of the DimDate table is done your PowerPivot data model will have two tables loaded in it The tables are named stock_analysis and DimDate Creating a Stock Report Once the two tables are loaded into the PowerPivot model click on Diagram View and connect the DimDate table with the stock_analysis table using Full Date Alternate Key and stock_date as shown in Figure 9 15 Drag stock_date to the Full Date Alternate Key column 156 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS LE Home Design Advanced Dee V F Al SS R7 Form
238. l in advance Through the strategic use of technology and automation an organization can increase its productivity and efficiency by automating recurring tasks associated with the daily workflow Apache Hadoop exposes Java interfaces for developers to programmatically manipulate and automate the creation of Hadoop clusters Microsoft NET Framework is part of the automation picture in HDInsight Existing NET developers can now leverage their skillset to automate workflows in the Hadoop world Programmers now have the option to write their MapReduce jobs in C and VB NET Additionally HDInsight also supports Windows PowerShell to automate cluster operations through scripts PowerShell is a script based workflow and is a particular favorite of Windows administrators for scripting their tasks There is also a command based interface based on Node js to automate cluster management operations This chapter will discuss the various ways to use the Hadoop NET Software Development Kit SDK Windows PowerShell and the cross platform Command Line Interface CLI tools to automate HDInsight service cluster operations Using the Hadoop NET SDK The Hadoop NET SDK provides NET client API libraries that make it easier to work with Hadoop from NET Since all of this is open source the SDK is hosted in the open source site CodePlex and can be downloaded from the following link http hadoopsdk codeplex com CodePlex uses NuGet packages to help you easil
239. latform The following sections describe ways to enable the logging and debugging of PowerShell script executions which can help you track down a cluster deployment failure Using the Write cmdlets PowerShell has built in cmdlets for logging that use the verb write Each of the cmdlets is controlled by a shell variable that ends with Preference argument For example to turn the warning messages on set the variable WarningPreference to Continue Table 12 2 summarizes the different types of write cmdlets that PowerShell offers with the usage description for each Table 12 2 PowerShell write cmdlets Cmdlet Function Write Debug The Write Debug cmdlet writes debug messages to the console from a script or command Write Error The Write Error cmdlet declares a nonterminating error By default errors are sent in the error stream to the host program to be displayed along with output Write EventLog The Write EventLog cmdlet writes an event to an event log To write an event to an event log the event log must exist on the computer and the source must be registered for the event log Write Host The Write Host cmdlet customizes output You can specify the color of text by using the ForegroundColor parameter and you can specify the background color by using the BackgroundColor parameter The Separator parameter lets you specify a string to use to separate displayed objects The particular result depends on the program that is hosting Wind
240. le Write 79 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Note Do not forget the supporting MapReduce classes SquareRootMapper SquareRootReducer SquareRootJob and Constants Using PowerShell Apart from the NET Framework HDInsight also supports PowerShell cmdlets for job submissions As of this writing the Azure HDInsight cmdlets are available as a separate download from the Microsoft download center In the future it will be a part of Windows Azure PowerShell version 0 7 2 and there will be no separate download Windows Azure HDInsight PowerShell can be downloaded from http www windowsazure com en us documentation articles hdinsight install configure powershell Writing Script For better code management and readability let s define a few PowerShell variables to store the path of the d11 files you will refer to throughout the script subscription Your Subscription Name cluster democluster storageAccountName democluster Container democlustercontainer storageAccountKey Get AzureStorageKey storageAccountName Primary storageContext New AzureStorageContext StorageAccountName storageAccountName StorageAccountKey storageAccountKey inputPath wasb example data gutenberg davinci txt g outputPath wasb example data WordCountOutputPS jarFile wasb example jars hadoop examples jar class wordcount secpasswd ConvertTo SecureString
241. le data WordCountOutput part r 00000 blockBlob DownloadToStream stream stream Position 0 StreamReader reader new StreamReader stream Console Write Done Word counts are n Console WriteLine reader ReadToEnd Run Hive Job public static void DoHiveOperations HiveJobCreateParameters hiveJobDefinition new HiveJobCreateParameters JobName Show tables job StatusFolder TableListFolder Query show tables J var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint 78 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER var creds new JobSubmissionCertificateCredential Constants subscriptionId cert Constants clusterName var jobClient JobSubmissionClientFactory Connect creds JobCreationResults jobResults jobClient CreateHiveJob hiveJobDefinition Console Write Executing Hive Job Wait for the job to complete WaitForJobCompletion jobResults jobClient Print the Hive job output System 10 Stream stream jobClient GetJobOutput jobResults JobId System 10 StreamReader reader new System 10 StreamReader stream Console Write Done List of Tables are n Console WriteLine reader ReadToEnd Monitor cluster Map Reduce statistics public static void MonitorCluster var client new AmbariClient Con
242. led with code 400 Cluster leftbehind state Specified Cluster password is invalid Ensure password is 10 characters long and has atleast onenumber one uppercase and one special character spaces not allowed Message NULL 54 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING To delete the newly provisioned cluster you can use the Remove AzureHDInsightCluster command as shown here Remove AzureHDInsightCluster AutomatedHDI SubscriptionId subid Certificate cert Table 4 2 summarizes the commands available in the HDInsight cmdlet and provides a brief overview of their functions Table 4 2 HDInsight cmdlet commands Command Function Add AzureHDInsightMetastore Customize the Hive Oozie metadata storage location Add AzureHDInsightStorage Add a new storage account to the subscription New AzureHDInsightCluster Provision a new HDInsight cluster New AzureHDInsightConfig Used to parameterize HDInsight cluster properties like number of nodes based on configured values Remove AzureHDInsightCluster Delete anHDInsight cluster Get AzureHDInsightCluster List the provisioned HDInsight cluster for the subscription Set AzureHDInsightDefaultStorage Set the default storage account for HDInsight cluster creations PowerShellcmdlets give you the flexibility of really taking advantage of the elasticity of services that Azure HDInsight provides You can create a PowerShell script that will spin up your Hadoop cluster when requi
243. liTrace you can switch to the IntelliTrace Calls View and see the function calls as shown in Figure 12 6 214 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS BE oe ol D Streaming Video Collecting and analyzing data in production Switch to IntelliTrace Events View D Calls for thread Main Thread 14316 E System Threading ThreadHelper ThreadStart0 D HadoopClient Program Main stringl args string 0x00000000 D HadoopClient Program CreateCluster System Func2 ctor G HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509Certificate D HadoopClient Program AnonymousMethod System Security Cryptography x509Certificates X509Certificate2 item System Security Cryptography X509 Certificate G HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509Certificate G HadoopClient Program AnonymousMethod System Security Cryptography x509Certificates X509Certificate2 item System Security Cryptography X509Certificate D HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509Certificate G HadoopClient Program AnonymousMethod System Security Cryptography X509Certificates X509Certificate2 item System Security Cryptography X509 Certificate ei HadoopClient Progr
244. llow SQL PDW to send queries to Hadoop and fetch data results The nice thing is that users can send regular SQL queries to PDW and Hadoop can run them and fetch data from unstructured files To learn more about PDW and Polybase see the following MSDN page http www microsoft com en us sqlserver solutions technologies data warehousing polybase aspx The Open Source Apache Hadoop project is going through a lot of changes as well In the near future Hadoop version 2 0 will be widely available Hadoop 2 0 introduces a new concept called Yet Another Resource Negotiator YARN on top of traditional MapReduce This is also known as MapReduce 2 0 or MRv2 With HDInsight internally using Hadoop it is highly likely that the Azure Service and the Emulator will be upgraded to Hadoop 2 0 as well in due course The underlying architecture however will be the same in terms job submissions and end user interactions hence the impact of this change to the readers and users will be minimal Summary The HDInsight offering is essentially a cloud service from Microsoft Since even evaluating the Windows Azure HDInsight Service involves some cost an emulator is available as a single node box product on your Windows Server systems which you can use as your playground to test and evaluate the technology The Windows Azure HDInsight Emulator uses the same software bits as the Azure Service and supports the exact same set of functionality It is designed to be scala
245. log file names are similar to the corresponding Windows service names that you see in the figure For example the Apache Hadoop NameNode service will log its operations to the namenode trace log file and so on Name lt Zi Apache Hadoop Derbyserver Sh Apache Hadoop hiveserver Gh Apache Hadoop hiveserver2 Za Apache Hadoop isotopejs Zi Apache Hadoop jobtracker Za Apache Hadoop metastore Za Apache Hadoop namenode no Apache Hadoop oozieservice Zb Apache Hadoop templeton Figure 11 1 HDInsight services 188 CHAPTER 11 LOGGING IN HDINSIGHT These logs record the messages and failures during service startup if there are any They also record the ID number of the process spawned when a service starts Following is a sample namenode trace log file It shows the content after a name node service startup HadoopServiceTraceSource Information 0 Tracing successfully initialized DateTime 2013 12 10T02 46 57 6055000Z Timestamp 3981555628 HadoopServiceTraceSource Information 0 Loading service xml c apps dist hadoop 1 2 0 1 3 1 0 06 bin namenode xml DateTime 2013 12 10T02 46 57 6055000Z Times tamp 3981598144 HadoopServiceTraceSource Information 0 Successfully parsed service xml for service namenode DateTime 2013 12 10T02 46 57 6211250Z Timestamp 3981610465 HadoopServiceTraceSource Information 0 Command line c apps dist java bin java server Xmx4096m Dhadoop log dir c apps dist hadoop 1 2 0 1 3 1 0 06 lo
246. lt cluster version used by Windows Azure HDInsight Service is 2 1 It is based on Hortonworks Data Platform version 1 3 0 It provides Hadoop services with the component versions summarized in Table 2 1 Table 2 1 Hadoop components in HDInsight 2 1 Component Version Apache Hadoop 1 2 0 Apache Hive 0 11 0 Apache Pig 0 11 Apache Sqoop 1 4 3 Apache Oozie 3 2 2 Apache HCatalog Merged with Hive Apache Templeton Merged with Hive Ambari API v1 0 Cluster Version 1 6 Windows Azure HDInsight Service 1 6 is another cluster version that is available It is based on Hortonworks Data Platform version 1 1 0 It provides Hadoop services with the component versions summarized in Table 2 2 Table 2 2 Hadoop components in HDInsight 1 6 Component Version Apache Hadoop 1 0 3 Apache Hive 0 9 0 Apache Pig 0 9 3 Apache Sqoop 1 4 2 Apache Oozie 3 2 0 Apache HCatalog 0 4 1 Apache Templeton 0 1 4 SQL Server JDBC Driver 3 0 15 CHAPTER 2 UNDERSTANDING WINDOWS AZURE HDINSIGHT SERVICE Note Both versions of the cluster ship with stable components of HDP and the underlying Hadoop eco system However recommend the latest version which is 2 1 as of this writing The latest version will have the latest enhancements and updates from the open source community It will also have fixes to bugs that were reported against previous versions For those reasons my preference is to run on the latest available version unless there is so
247. lumn mappings between the source and the destination are correct as shown in Figure 10 17 Click on OK to complete the configuration 181 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES D OLE 08 Destination canor ZE Configure the properties used to insert data into a relational database using an OLE DB provider Connection Manager Error Output Available Input Columns Name i ae stock_date stock opce open stock_price_high stock _price_low stock_price_close stock_volume stock_price_adj_close exchange Input Column ao stock_date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close exchange Available Destination Colu stock _price_close stock_volume stock_price_adj_close exchange Destination Column stock_symbol stock_date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close exchange Figure 10 17 Verifying the column mappings Caution If you choose to create the target table yourself and specify different column names than the source you have to manually map each of these source and destination columns SSIS s inbuilt column mapping intelligence is based on having the same column names so if they differ make sure you set up the column mappings correctly The data flow with the source and destination along with the connectio
248. luster provisioning based on C code and PowerShell scripts Chapter 5 Submitting Jobs to Your HDInsight Cluster shows you ways to submit MapReduce jobs to your HDInsight cluster You can leverage the same NET and PowerShell based framework to submit your data processing operations and retrieve the output This chapter also teaches you how to create a MapReduce job in NET Again this is unique in HDInsight as traditional Hadoop jobs are based on Java only Chapter 6 Exploring the HDInsight Name Node discusses the Azure virtual machine that acts as your cluster s Name Node when you create a cluster You can log in remotely to the Name Node and execute command based Hadoop jobs manually This chapter also speaks about the web applications that are available by default to monitor cluster health and job status when you install Hadoop Chapter 7 Using the Windows Azure HDInsight Emulator introduces you to the local one box emulator for your Azure service This emulator is primarily intended to be a test bed for testing or evaluating the product and your solution before you actually roll it out to Azure You can simulate both the HDInsight cluster and Azure storage so that you can evaluate it absolutely free of cost This chapter teaches you how to install the emulator set the configuration options and test run MapReduce jobs on it using the same techniques Chapter 8 Accessing HDInsight over Hive and ODBC talks abou
249. mber of rows downloaded from your blob storage and even load the data into a data model as shown in Figure 9 27 Open Price KJ High Price J Low Price J Column6 KJ Column7 Columns Query Settings AAPL 5 8 2013 464 69 470 67 462 15 469 45 11369300 469 45 AAPL 2 3 2013 458 01 462 85 456 66 462 54 9780900 462 54 Query1 AAPL 1 8 2013 455 75 456 8 453 26 456 68 7277400 456 68 AAPL 31 07 2013 454 99 457 34 449 43 452 53 11518700 452 53 2 AAPL 30 07 2013 449 96 457 15 449 23 453 32 11050800 453 32 ak SCN AAPL 29 07 2013 440 8 449 99 440 2 447 79 8859200 447 79 AAPL 26 07 2013 435 3 441 04 434 34 440 99 7148300 440 99 Filter amp Shape AAPL 25 07 2013 440 7 441 4 435 81 438 5 8196200 438 5 AAPL 24 07 2013 438 93 444 59 435 26 440 51 21140600 440 51 AAPL 23 07 2013 426 426 96 418 71 418 99 13192700 418 99 Enable download AAPL 22 07 2013 429 45 429 75 425 47 426 31 7421300 426 31 On cE AAPL 19 07 2013 433 1 433 98 424 35 424 95 9597200 424 95 AAPL 18 07 2013 433 38 434 87 430 61 431 76 7817100 431 76 atl OEE E E AAPL 17 07 2013 429 7 432 22 428 22 430 31 7106800 430 31 AAPL 16 07 2013 426 52 430 71 424 17 430 2 7733500 430 2 On E AAPL 15 07 2013 425 01 431 46 424 8 427 44 8639900 427 44 AAPL 12 7 2013 427 65 429 79 423 41 426 51 9984400 426 51 Load to data model AAPL 11 7 2013 422 95 428 25 421 17 427 29 11653300 427 29 Figure 9 27 Importing data using Power Query Note Power Query can directly fetch HDFS data and place it in your Excel works
250. me specific reason to do otherwise by running some older version The component versions associated with HDInsight cluster versions may change in future updates to HDInsight One way to determine the available components and their versions is to login to a cluster using Remote Desktop go directly to the cluster s name node and then examine the contents of the C apps dist directory Storage Location Options When you create a Hadoop cluster on Azure you should understand the different storage mechanisms Windows Azure has three types of storage available blob table and queue e Blob storage Binary Large Objects blob should be familiar to most developers Blob storage is used to store things like images documents or videos something larger than a first name or address Blob storage is organized by containers that can have two types of blobs Block and Page The type of blob needed depends on its usage and size Block blobs are limited to 200 GBs while Page blobs can go up to 1 TB Blob storage can be accessed via REST APIs with a URL such as http debarchans blob core windows net MyBLOBStore e Table storage Azure tables should not be confused with tables from an RDBMS like SQL Server They are composed of a collection of entities and properties with properties further containing collections of name type and value One thing I particularly don t like as a developer is that Azure tables can t be accessed using ADO NET metho
251. min KB MB 07 05 job_201311240635 0002 1385276905889 admin TempletonControllerJob joblauncher 20F file 927 3 256 2013 11 24 wer r admin ER MB 07 09 job 201311240635 0002 conf xml file 53 58 3 256 2013 11 24 nwer r admin KB MB 07 08 job 20131 5 0003 770789 in Templetoni ob jol r 20 fle 9 27 3 256 2013 11 24 rw r r admin KB MB 07 11 job 201311240635 0003 confxml file 53 46 3 256 2013 11 24 rw r r admin KB MB 07 11 job_201311240635 0004 1385277772398 admin TempletonControllerJob joblauncher 20F file 927 3 256 2013 11 24 rz admin ER MB 07 23 Figure 6 14 The job configurations The Name Node Status Portal is a part of the Apache Hadoop project making it familiar to existing Hadoop users The main advantage of the portal is that it lets you browse through the file system as if it is a local file system That s an advantage because there is no way to access the file system through standard tools like Windows Explorer as the entire storage mechanism is abstracted in WASB The TaskTracker Portal Apart from the Name Node and MapReduce status portals there is also a TaskTracker web interface that is available only in the data nodes or task nodes of your cluster This portal listens on port 50060 and the complete URL to launch itis http lt DataNode_IP_Address gt 50060 tasktracker jsp Although there is a single TaskTracker per slave node each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parall
252. ml 220 Hadoop JobTracker Log 224 225 jobtracker trace log 222 mapred site xml 222 spilling 226 status portal 91 104 types 219 Microsoft HDInsight Apache Hadoop ecosystem cluster components 5 hadoop distributed file system 5 MapReduce 5 puposes features 5 big data and relationships 2 difference of 3 end to end platform 3 implementation factors 4 PDW 4 queries 3 questions of 2 right solution 3 three V s problem 1 combination with business analytics of data 10 data collection 10 data sources 11 enterprise BI solution 10 11 models of 9 Hadoop based distribution 1 Hadoop on Windows Hadoop clusters 7 TaaS 7 Microsoft data platform 8 Windows Azure HDInsight Emulator 7 Windows Azure HDInsight Service 7 MRRunner HadoopJob double hyphen 86 implementation 85 MRLib 86 HDInsight distribution 85 output 86 windows batch file 87 245 INDEX N Name Node status portal 91 106 O Open Source Apache project 75 Override 62 PQ Parallel Data Warehouse PDW 4 125 PARTITIONED BY clause 129 Pig jobs failures EXPLAIN command 235 file configuration 234 ILLUSTRATE command 238 Stack Trace 235 Platform as a Service PaaS 13 Port Number 143 Power Business Intelligence futures 163 map 163 query 163 Azure HDInsight 164 cluster storage 164 filtering csv files 165 formatting data 165 query editor screen 165 uses 166 PowerPivot enhancements AdventureW
253. money and resources especially if that company is merely trying to evaluate a Big Data solution or if they are unsure of the value that a Big Data solution may bring to the business Microsoft offers the Windows Azure HDInsight service as part of an Infrastructure as a Service IaaS cloud offering This arrangement relieves businesses from setting up and maintaining the Big Data infrastructure on their own so they can focus more on business specific solutions that execute on the Microsoft cloud data centers This chapter will provide insight into the various Microsoft cloud offerings and the Windows Azure HDInsight service Microsoft s Cloud Computing Platform Windows Azure is an enterprise class cloud computing platform that supports both Platform as a Service PaaS to eliminate complexity and IaaS for flexibility IaaS is essentially about getting virtual machines that you must then configure and manage just as you would any hardware that you owned yourself PaaS essentially gives you preconfigured machines and really not even machines but a preconfigured platform having Windows Azure and all the related elements in place and ready for you to use Thus PaaS is less work to configure and you can get started faster and more easily Use PaaS where you can and IaaS where you need to With Windows Azure you can use PaaS and IaaS together and independently you can t do that with other vendors Windows Azure integrates with what you have in
254. must have the entry of the Azure storage account and the account key to access the Azure blobs and function correctly Here is the snippet of our cluster s core site xml which uses the democluster blob as its cluster storage lt property gt lt name gt fs azure account key democluster blob core windows net lt name gt Va Ly ey BERRA k kkk kkk lt yalue gt lt property gt So the output folder and the file you just created is actually on your blob container for democluster To confirm this you can go to your Windows Azure Management Portal and see the blobs you just created as part of your cluster s data as shown in Figure 6 9 e example data WordCountOutputPS part r 00000 http demociuster bic example data commandlineoutput http demociuster bdic demociustercontai example data commandilineoutput part r 00000 http democluster bic example data gutenberg http democluster bic Figure 6 9 WASB container for democluster The Hive Console Hive is an abstraction over HDFS and MapReduce It enables you to define a table like schema structure on the underlying HDFS actually WASB in HDInsight and it provides a SQL like query language to read data from the tables The Hadoop Command Line also gives you access to the Hive console from which you can directly execute the Hive Query Language HQL to create select join sort and perform many other operations with the cluster data Internally the HQL queries are broke
255. n BA Microsoft Office PowerPivot for Excel 2013 Microsoft Power Query for Excel Power View Team Foundation Add in Location C Program Files Microsoft Office 15 Root Officel 5 ADDINS PowerPivot Excel Add in PowerPivotExcelClientA Load Behavior Load at Startup Figure 9 1 Enabling the Excel add ins Note PowerPivot is also supported in Excel 2010 Power View and Power Query are available only in Excel 2013 To create a PowerPivot model open Excel navigate to the POWERPIVOT ribbon and click on Manage as shown in Figure 9 2 H zx Crs HOME INSERT PAGE LAYOUT FORMULAS DATA REVIEW VIEW Load Test DATA EXPLORER POWERPIVOT Team E LO B BI Align Vertically Ka KE D I ei D ER Align Horizontally S Manage Calculated KPIs Addto Update Detect Settings Fields X Data Model All cb Calculations Slicer Alignment Tables Relationships Figure 9 2 PowerPivot for Excel 2013 Clicking on the Manage icon will bring up the PowerPivot for Excel window where you need to configure the connection to Hive Click on Get External Data and select From other Sources as shown in Figure 9 3 SS mm a eB de id 5 oo PowerPivot for Excel FBlnsights xIsx Home Design Advanced ap G i Ki Data Type Text Al wv D e S Format Text zi P Paste Get External Refresh PivotTable a Clears E Data gt 760 0 Se Fitter Clipboard Formatting Sort a dateandtir FA RH ei 5 Existing weeklypeopl
256. n Big Data is the new buzzword for a data mining technology that has been around for quite some time Data analysts and business managers are fast adopting techniques like predictive analysis recommendation service clickstream analysis etc that were commonly at the core of data processing in the past but which have been ignored or lost in the rush to implement modern relational database systems and structured data storage Big Data encompasses a range of technologies and techniques that allow you to extract useful and previously hidden information from large quantities of data that previously might have been left dormant and ultimately thrown away because storage for it was too costly Big Data solutions aim to provide data storage and querying functionality for situations that are for various reasons beyond the capabilities of traditional database systems For example analyzing social media sentiments for a brand has become a key parameter for judging a brand s success Big Data solutions provide a mechanism for organizations to extract meaningful useful and often vital information from the vast stores of data that they are collecting Big Data is often described as a solution to the three V s problem Variety It s common for 85 percent of your new data to not match any existing data schema Not only that it might very well also be semi structured or even unstructured data This means that applying schemas to the data before or dur
257. n down to MapReduce jobs that execute and generate the desired output that is returned to the user To launch the Hive console navigate to the c apps dist hive 0 11 0 1 3 1 0 06 bin folder from the Hadoop Command Line and execute the Hive command This should start the Hive command prompt as shown in Listing 6 6 Listing 6 6 The Hive console c apps dist hive 0 11 0 1 3 1 0 06 bin gt hive Logging initialized using configuration in file C apps dist hive 0 11 0 1 3 1 0 06 conf hive log4j properties hive gt If you run the show tables command it will show you similar output as you saw when you ran your Hive job from the NET program in Chapter 5 as in Listing 6 7 96 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Listing 6 7 The show tables command hive gt show tables OK aaplstockdata hivesampletable stock_analysis stock_analysis1 Time taken 3 182 seconds Fetched 4 row s You can create new tables populate them based on the data files in your blob containers in different partitions and query them based on different criteria directly from the Hive console However using NET SDK and PowerShell are the recommended ways of making Hive job submissions in HDInsight rather than running them interactively from the console Note Details of Hive operations are covered in Chapter 8 of this book The Sqoop Console Sqoop is an Open Source Apache project that facilitates bi directional data exchange between Hado
258. n managers should like Figure 10 18 182 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES OLE DB Destination D Connection Managers E Hive Connection EC SQL Connection Figure 10 18 The complete data flow Running the Package Voila You are all set to go From the menu bar select Debug gt Start Debugging press F5 or click the Play button in the toolbar to execute the package as shown in Figure 10 19 lit View Project Build Debug Team SQL Data Format SSIS Tools Architecture Test Analyze Window Help aly al aw A 4a A 9 ls 1 E Development Default ErrorOccuredDuringConne AX Packag Figure 10 19 Executing the package 183 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRATION SERVICES The package should run successfully transfer records from the Hive table to the SQL Server table and display the total number of records imported as shown in Figure 10 20 Y ADO NET a Source 36 153 rows Y OLE DB Destination B Figure 10 20 Successful package execution If you are running this package in a 64 bit Windows operating system you need to change the Run64BitRuntime property to False This can be done from the Project Properties gt Configuration Properties gt Debugging tab to execute the package as it is using the 32 bit Hive ODBC driver as shown in Figure 10 21 184 CHAPTER 10 INTEGRATING HDINSIGHT WITH SQL SERVER INTEGRA
259. n string args ListClusters CreateCluster DeleteCluster Console ReadKey public static void ListClusters var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds var clusters client ListClusters foreach var item in clusters 49 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING Console WriteLine Cluster 0 Nodes 1 item Name item ClusterSizeInNodes public static void CreateCluster var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HDInsightClient Connect creds Cluster information var clusterInfo new ClusterCreateParameters Name AutomatedHDICluster Location East US DefaultStorageAccountName Constants storageAccount DefaultStorageAccountKey Constants storageAccountKey DefaultStorageContainer Constants container UserName Constants clusterUser Password Constants clusterPassword ClusterSizeInNodes 2 var clusterDetails client CreateCluster clusterInfo console Wr
260. n the next chapter you will learn about troubleshooting the different types of job submission failures in HDInsight 217 CHAPTER 13 Troubleshooting Job Failures There are different types of jobs you can submit to your HDInsight cluster and it is inevitable that you will run into problems every now and then while doing so Though most HDInsight jobs are internally executed as MapReduce jobs there are different techniques for troubleshooting high level supporting projects like Hive Pig Oozie and others that make life easier for the developer In this chapter you will learn to troubleshoot the following types of failures e MapReduce job failures e Hive job failures e Pig job failures e Sqoop job failures e Windows Azure Storage Blob failures e Cluster connectivity failures MapReduce Jobs All MapReduce job activities are logged by default in Hadoop in the C apps dist hadoop 1 2 0 1 3 1 0 06 logs directory of the name node The log file name is of the format HADOOP jobtracker hostname log The most recent data is in the log file older logs have their date appended to them In each of the Data Nodes or Task Nodes you will also find a subdirectory named userlogs inside the C apps dist hadoop 1 2 0 1 3 1 0 06 logs folder This directory will have another subdirectory for every MapReduce task running in your Hadoop cluster Each task records its stdout output and stderr error to two files in this subdirectory
261. n time fs azure selfthrottling read factor used when reading data from WASB fs azure selfthrottling write factor used when writing data to WASB Note Valid values for these settings are in the following range 0 1 Example 1 If your cluster has n 20 nodes and is primarily doing heavy write operations you can calculate the appropriate fs azure selftthrottling write factor value for a storage account with geo replication on fs azure selfthrottling write factor 5Gbps 800Mbps 20 0 32 Example 2 If your cluster has n 20 nodes and is doing heavy read operations you can calculate the appropriate fs azure selfthrottling read factor value for a storage account with geo replication off fs azure selfthrottling read factor 15Gbps 1600Mbps 20 0 48 If you still find that throttling continues after adjusting the parameter values just shown further analysis and adjustment may be necessary 240 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Connectivity Failures There are a few ways you can connect to your cluster You can use remote desktop login to connect to the head node you can use the ODBC endpoint on port 443 to connect to the Hive service and you can navigate through the REST based protocols to different URLs from Internet Explorer Always make sure to test these different types of connections when you encounter a specific problem For example if you are unable to remotely log in to one of your data no
262. nagement HDinsight Cmdlets zip Figure 4 9 HDInsight management cmdlets Note This step of installing the cmdlets won t be needed in the future when the HDInsight cmdlets are integrated and installed as part of Windows Azure PowerShell version 0 7 2 This book is based on Windows Azure PowerShell version 0 7 1 which does require this installation step Launch the Windows Azure PowerShell command prompt and load the HDInsight cmdlet by executing the following command Import Module D HadoopClient Microsoft WindowsAzure Management HDInsight Cmdlet d11 This will load the required set of HDInsightcmdlets in PowerShell PS C gt Import Module D HadoopClient Microsoft WindowsAzure Management HDInsight Cmdlet d1l VERBOSE Loading module from path D HadoopClient Microsoft WindowsAzure Management HDInsight Cmdlet dll VERBOSE Importing cmdlet Add AzureHDInsightMetastore VERBOSE Importing cmdlet Add AzureHDInsightStorage VERBOSE Importing cmdlet New AzureHDInsightCluster VERBOSE Importing cmdlet New AzureHDInsightConfig VERBOSE Importing cmdlet Remove AzureHDInsightCluster VERBOSE Importing cmdlet Get AzureHDInsightCluster VERBOSE Importing cmdlet Set AzureHDInsightDefaultStorage 51 CHAPTER A AUTOMATING HDINSIGHT CLUSTER PROVISIONING Note The path of the Microsoft WindowsAzure Management HDInsight Cmdlet d11 file might vary depending on where you choose to download it
263. name from any client application connection A DSN essentially creates an alias for your data source you can change where the DSN is pointing and your applications will continue to work However the downside to this approach is that you ll need to make sure the DSN exists on all machines that will be running your applications The alternate way of establishing a connection without a DSN is to use a connection string in your application The advantage of using a connection string is that you don t have to pre create the DSN on the systems that will be running your application The connection string parameters can be a little tricky but this is a preferred approach because it removes the external DSN dependency Also note that the same connection string works for both 32 bit and 64 bit execution modes So you can avoid creating multiple DSNs you just need to ensure that both versions of the ODBC driver are installed Table 8 1 summarizes the connection string attribute you need to set to create a DSN less connection to Hive using the Microsoft Hive ODBC driver 144 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Table 8 1 Connection string to hive Field Description Driver Name of the driver Microsoft Hive ODBC Driver Host DNS hostname of your cluster Port Connection port The Azure HDInsight Service is 443 and the Azure HDInsight Emulator is 10001 Schema Default database schema RowsFetchedPerBlock Number of rows fetc
264. nd Azure HDInsight Emulator using the Microsoft Hive ODBC driver You also learned to create a DSN less connection to HDInsight for client applications to connect using a connection string 145 CHAPTER 9 Consuming HDInsight from Self Service BI Tools Self service Business Intelligence BI is the talk of the town at the moment As the term suggests self service BI is a concept through which you can perform basic data analysis and extract intelligence out of that data with easy to use tools without needing to hire a suite of BI experts or implement a data warehouse solution Self service BI is certainly a trend toward the consumerization of IT and BI The trend is that an individual or even a really small scale and growing company can afford BI to implement a better decision making process This chapter will focus on the various self service BI tools available from Microsoft that provide strong integration with HDInsight and help in the following analytics and reporting processes H PowerPivot H Power View H Power BI PowerPivot Enhancements With SQL Server 2012 Microsoft has enhanced the data analysis capabilities of PowerPivot for both the client side component PowerPivot for Excel and the server side component PowerPivot for SharePoint to provide enhanced self service BI functionality to all Microsoft Office users The new enhancements in PowerPivot help users integrate data from multiple sources more easily create reports
265. nd PowerShell Using the Hadoop NET SDK Hadoop streaming is a utility that comes with the Hadoop distribution The utility allows you to create and run MapReduce jobs with any executable or script as the mapper and or the reducer This is essentially a Hadoop API to MapReduce that allows you to write map and reduce functions in languages other than Java NET Perl Python and so on Hadoop Streaming uses Windows streams as the interface between Hadoop and the program so you can use any language that can read standard input and write to standard output to write the MapReduce program This functionality makes streaming naturally suited for text processing In this chapter I focus only on NET to leverage Hadoop streaming The mapper and reducer parameters are NET types that derive from base Map and Reduce abstract classes The input output and files options are analogous to the standard Hadoop streaming submissions The mapper and reducer allow you to define a NET type derived from the appropriate abstract base classes The objective in defining these base classes was not only to support creating NET Mapper and Reducer classes but also to provide a means for Setup and Cleanup operations to support in place Mapper Combiner Reducer optimizations utilize IEnumerable and sequences for publishing data from all classes and finally provide a simple submission mechanism analogous to submitting Java based jobs The basic logic behind MapReduce is that
266. nd to end up with problems Whether the problems are related to the manual or programmatic deployments of clusters or submitting your MapReduce jobs troubleshooting is the art of logically removing the roadblocks that stand between you and your Big Data solution This chapter will focus specifically on common cluster deployment failure scenarios and ways to investigate them Cluster Creation As you saw in Chapter 3 creating a cluster using either Quick Create or Custom Create involves a sequence of operations that need to be successfully completed to make the cluster operational The phases are marked by the status shown at each stage e Submitting e Accepted e Windows Azure VM Configuration e HDInsight Configuration e Running Table 12 1 explains what goes on behind the scenes during each of these phases Table 12 1 Status designations when creating HDInsight clusters Status What it means Submitting The communication in this step is between the Azure portal and the HDInsight Deployment Service which is a REST API provided by HDInsight in Azure for its internal use in these kind of operations If there is a failure here it is likely a problem with the parameters of the setup or a serious failure of the Deployment service Accepted The HDInsight Deployment Service orchestrates the actions from this point forward communicating status back to the Azure portal A hidden Cloud Service is provisioned as a container and then Clust
267. ndows Azure PowerShell command prompt Save the script file as SubmitJob ps1 in a location of your choice and execute it from the PowerShell prompt You should see an output similar to the following once the script completes successfully PS C gt C SubmitJob ps1 StatusDirectory Ofac8406 891d 41ff af74 eaac21386fd3 ExitCode 0 Name wordcount Query d State Completed SubmissionTime 12 9 2013 7 47 05 PM Cluster democluster PercentComplete map 100 reduce 100 JobId job_201311240635_ 0192 13 12 09 19 47 19 INFO input FileInputFormat Total input paths to process 1 13 12 09 19 47 19 WARN snappy LoadSnappy Snappy native library is available 13 12 09 19 47 19 INFO util NativeCodeLoader Loaded the native hadoop library 83 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER 13 12 09 19 47 19 INFO snappy LoadSnappy 13 12 09 19 47 19 INFO mapred JobClient 13 12 09 19 47 20 INFO mapred JobClient 13 12 09 19 47 29 INFO mapred JobClient 13 12 09 19 47 37 INFO mapred JobClient 13 12 09 19 47 39 INFO mapred JobClient 13 12 09 19 47 41 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient Snappy native library loaded Running job job 201311240635 0193 map 0 reduce 0 map 100 reduce 0 map 100 reduce 33 map 100 reduce 100 Job com
268. nge NASDAQ OK Time taken 44 327 seconds Repeat the preceding steps for all the csv files you have to load into the table Note that you need to replace only the csv file names in querystring and make sure you load the data into the respective partitions of the Hive table Listing 8 7 gives you all the LOAD commands for each of the csv files Listing 8 7 The LOAD commands querystring load data inpath wasb democlustercontainer democluster blob core windows net debarchan StockData tableFacebook csv into table stock_analysis partition exchange NASDAQ querystring load data inpath wasb democlustercontainer democluster blob core windows net debarchan StockData tableApple csv into table stock_analysis partition exchange NASDAQ querystring load data inpath wasb democlustercontainer democluster blob core windows net debarchan StockData tableGoogle csv into table stock_analysis partition exchange NASDAQ querystring load data inpath wasb democlustercontainer democluster blob core windows net debarchan StockData tableIBM csv into table stock_analysis partition exchange NYSE querystring load data inpath wasb democlustercontainer democluster blob core windows net debarchan StockData tableOracle csv into table stock_analysis partition exchange NYSE Querying Tables with HiveQL After you create tables and load data files into the appropriate locations you can start to q
269. nistrator E System DSN File DSN Drivers Tracing Connection Pooling About Jata Sources Driver Add Remove Configure Figure 8 5 Add System DSN Choose the Microsoft Hive ODBC Driver driver in the next screen of the Create New Data Source wizard as shown in Figure 8 6 eebe PS j aa Select a driver for which you want to set up a data source Version Com EE felt ee Ed 1 00 00 00 Men SQL Server 6 01 7601 17514 Men SQL Server Native Client 11 0 2011 110 3000 00 Me D Figure 8 6 Selecting the Microsoft Hive ODBC Driver After clicking Finish you are presented with the final Microsoft Hive ODBC Driver DSN Setup screen where you ll need to provide the following Host This is the full domain name of your HDInsight cluster democluster azurehdinsight net Port 443 is the default Database default e Hive Server Type Hive Server 2 139 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC e Authentication Mechanism Select Windows Azure HDInsight Service e User Name amp Password This will be the user name and password you used while creating your cluster Enter the HDInsight cluster details as well as the credentials used to connect to the cluster In this sample Iam using my HDInsight cluster deployed in Windows Azure as shown in Figure 8 7 Microsoft Hive ODBC Driver DSN Setup ks Data Source Name Description DSN to Connect to Hive on HDInsight Host demoduster
270. nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 219 MapReduce JODS eee nee coe nee oS ene eS aana eee eee ee ee a 219 Configuration FICS ses cccscscsesercvsececceseseseversisterceeseccsezsrateantancecyescausencesteecacsescuccerascenscoesedesteenereaneescanyeseuteenersteeetents 220 a E A AA A A A AT AAAA AA stevedeval eetevacsuedevadecatecssavadecttavelscstexelevat vetes 222 Eder 225 Concatenate dTldvU ET 226 tot Kr UE 226 Hive JODS ae PP PP PP Pr EP PPP PPP PP PP PPD PPP PD PE PEPE PPP PPP ED rr EEN EEN gt PTE PrErrrEr rr res 226 POG Egger Serge Eege nated eect a et eee 227 Compress Intermediate NEE 232 Configure ERT IEN 233 Implement Map TE 233 a a E 234 Configuration Filana NN 234 rte UE 235 EX Oca CONTAIN E E E E E 235 SGAE IL PAAA 238 xi CONTENTS St 60 i nananana Sanaa ko N Akaan asara aaua F ANKAA AASA rar A Kokak Eiaa ara TEN r raii inaa 238 Windows Azure Storage Blob ccccssseseseseeseeseeseeseeseeseeseeseeseeseeseesassessassassassessassaseaseaseas 239 WASB Atherpratiopn eeeEeket EES cacvessnsesvenes ENEE sivsestncvussecaci casevnsvei secbedunsnsbsiusss ENEE sivassceatesrscssvestes 239 GU UO 239 Connectivity TTT 241 SHEI 242 JO 243 xii About the Author Debarchan Sarkar debarchans is a Senior Support Engineer on the Microsoft HDInsight team and a technical author of books on SQL Server BI and Big Data His total tenure at Microsoft is 6 years and he was with SQL Server BI team before diving deep
271. nsight offering Accessing the HDinsight Name Node You have to enable remote connectivity to your name node from the Azure Management portal By default remote login is turned off You can enable it from your cluster s configuration screen as shown in Figure 6 1 89 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE It gt lt DELETE ENABLE REMOTE Figure 6 1 Enabling Remote Desktop to access the cluster name node Create the user to be granted remote desktop access to the name node in the Configure Remote Desktop screen as shown in Figure 6 2 Be sure to supply a password You also have to choose an expiration date for this user account For security reasons you will need to reconfigure your remote desktop user every seven days The expiration date that needs to be set will not accept a date greater than a week into the future IGURE HDINSIGHT Configure Remote Desktop USER NAME hadoopuser PASSWORD CONFIRM PASSWORD EXPIRES ON 2013 12 16 Figure 6 2 Configuring a Remote Desktop user Within a minute or two Remote Desktop will be enabled for your cluster You will then see the Connect option as shown in Figure 6 3 x CONNECT DISABLE REMOTE Figure 6 3 Remote Desktop enabled Click on the Connect link Open the Remote Desktop file democluster azurehdinsight net rdp Accept the couple of security prompts you might get Choose not to prompt again You will then get a screen where you need to provide
272. ntegration I strongly believe that the future of IT will be seen in the form of integration and collaboration opening up new dimensions in the industry The world of data has seen exponential growth in volume in the past couple of years With the web integrated in each and every type of device we are generating more digital data every two years than the volume of data generated since the dawn of civilization Learning the techniques to store manage process and most importantly make sense of data is going to be key in the coming decade of data explosion Apache Hadoop is already a leader as a Big Data solution framework based on Java Linux This book is intended for readers who want to get familiar with HDInsight which is Microsoft s implementation of Apache Hadoop on Windows Microsoft HDInsight is currently available as an Azure service Windows Azure HDInsight Service brings in the user friendliness and ease of Windows through its blend of Infrastructure as a Service IaaS and Platform as a Service PaaS Additionally it introduces NET and PowerShell based job creation submission and monitoring frameworks for the developer communities based on Microsoft platforms Intended Audience Pro Microsoft HDInsight is intended for people who are already familiar with Apache Hadoop and its ecosystem of projects Readers are expected to have a basic understanding of Big Data as well as some working knowledge of present day Business Intelligence BI
273. ny springer sbm com or visit www springeronline com Apress Media LLC is a California LLC and the sole member owner is Springer Science Business Media Finance Inc SSBM Finance Inc SSBM Finance Inc is a Delaware corporation For information on translations please e mail rights apress com or visit www apress com Apress and friends of ED books may be purchased in bulk for academic corporate or promotional use eBook versions and licenses are also available for most titles For more information reference our Special Bulk Sales eBook Licensing web page at www apress com bulk sales Any source code or other supplementary material referenced by the author in this text is available to readers at www apress com For detailed information about how to locate your book s source code go to www apress com source code I dedicate my work to my mother Devjani Sarkar All that I am or hope to be I owe to you my Angel Mother You have been my inspiration throughout my life I learned commitment responsibility integrity and all other values of life from you You taught me everything to be strong and focused to fight honestly against every hardship in life I know that I could not be the best son but trust me each day when I wake up I think of you and try to spend the rest of my day to do anything and everything just to see you more happy and proud to be my mother Honestly I never even dreamed of publishing a book some day Your
274. oEnd The entire DoMapReduce method should look similar to Listing 5 9 Listing 5 9 DoMapReduce method public static void DoMapReduce 70 Define the MapReduce job MapReduceJobCreateParameters mrJobDefinition new MapReduceJobCreateParameters JarFile wasb example jars hadoop examples jar ClassName wordcount I mrJobDefinition Arguments Add wasb example data gutenberg davinci txt mrJobDefinition Arguments Add wasb example data WordCountOutput Get certificate var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new JobSubmissionCertificateCredential Constants subscriptionId cert Constants clusterName Create a hadoop client to connect to HDInsight var jobClient JobSubmissionClientFactory Connect creds CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Run the MapReduce job JobCreationResults mrJobResults jobClient CreateMapReduceJob mrJobDefinition Console Write Executing WordCount MapReduce Job Wait for the job to complete WaitForJobCompletion mrJobResults jobClient Print the MapReduce job output Stream stream new MemoryStream CloudStorageAccount storageAccount CloudStorageAccount Parse DefaultEndpointsProtocol https AccountName _ Constants storageAccount AccountKey
275. of the column stock_date to Date as shown in Figure 9 11 153 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS E EC EES SS Home Design Advanced we fan d ars A Paste Get External Refresh PivotTable F Clear All Sort Ep Data V Date Zo Filters Colum Clipboard Decimal Number Sort and Filter stock_date 5 8 2013 Leen j stock_price_high Blood IBM 5 8 2013 12 0 195 16 195 88 194 IBM 2 8 2013 12 0 195 5 195 5 193 IBM 8 2013 12 0 196 65000000000001 197 16999999999999 195 IBM 31 2013 12 194 49000000000001 196 91 194 IBM 30 2013 12 196 99000000000001 197 83000000000001 195 IBM 29 2013 12 196 83000000000001 197 19 195 IBM 26 2013 12 196 59 197 37 195 Figure 9 11 Changing the stock_date to the Date data type Select the columns from stock_price_open to stock_price_close and set their data type to Decimal as shown in Figure 9 12 RS Home Design Advanced s IS z war gt Autosum L S a E VK Text Ad X Uf it ic Create KPI Paste Get External Refresh PivotTable e Clear All Sort by Find EE Data Date Ba Filters Column Clipboard Y Decimal Number Sort and Filter Find Calculations stock_price_ v 195 5 Whole Number a stock_date HES Currency IBM 5 8 2013 12 0 rna IBM 2 8 2013 12 0 195 5 IBM 1 8 2013 12 0 196 65 194 49 IBM 7 31 2013 12 Figure 9 12 Changing columns to the Decimal data typ
276. ogical Plan The Logical Plan gives you the chain of operators used to build the relations along with data type validation Any filters like NULL checking that might have been applied early on also apply here e Physical Plan The Physical Plan shows how the logical operators are actually translated as physical operators with some memory optimization techniques that might have been used e MapReduce Plan The MapReduce Plan shows how the physical operators are grouped into MapReduce jobs that would actually work on the cluster s data Illustrate Command The ILLUSTRATE command is one of the best ways to debug Pig scripts The command attempts to provide a reader friendly representation of the data ILLUSTRATE works by taking a sample of the output data and running it through the Pig script But as the ILLUSTRATE command encounters operators that remove data such as filter join etc it makes sure that some records pass through the operator and some do not When necessary it will manufacture records that look similar to the data set For example if you have a variable B formed by grouping another variable A the ILLUSTRATE command on variable B will show you the details of the underlying composite types Type in the following command in the Pig shell to check this out A LOAD data AS f1 int f2 int f3 int B GROUP A BY f1 2 ILLUSTRATE B This will give you output similar to what is shown here You can use the ILLUSTRAT
277. ol over what your users decide to log in their MapReduce code but what you do have control over is the task attempt and execution log levels Each of the data nodes have a userlogs folder inside the C apps dist hadoop 1 2 0 1 3 1 0 06 logs directory This folder contains a historical record of all the MapReduce jobs or tasks executed in the cluster To create a complete chain of logs however you need to visit the userlogs folder of every data node in the cluster and aggregate the logs based on timestamp This is because the name node dynamically picks which data nodes to execute a specific task during a job s execution Figure 11 5 shows the userlogs directory of one of the data nodes after a few job executions in the cluster 196 CHAPTER 11 LOGGING IN HDINSIGHT Organize e Include in library Sharewith New folder sir Favorites Name Date modified Type ES Desktop Ji job_201312100246_0003 12 10 2013 3 53 AM Pie Folder Jp Downloads Ji job_201312100246_0006 12 10 2013 3 57 AM File Folder Recent Places job_201312100246_0011 12 10 2013 4 04 AM File folder J job_201312100246_0012 12 10 2013 4 04 AM File Folder a anes Ji job_201312100246_0013 12 10 2013 4 25 AM Pie Folder Documents D Ee 1 job_201312100246_0016 12 10 2013 4 24 AM File Folder Gi Pictures J job_201312100246_0018 12 10 2013 5 15AM File Folder H Videos A job_201312100246_0019 12 10 20135 16 4M File folder Ji job_201312100246_0021 12 10 2013 5 23 AM File
278. on the cross platform CLI tools have a look at the following http www windowsazure com en us manage install and configure cli 57 CHAPTER A AUTOMATING HDINSIGHT CLUSTER PROVISIONING Summary The Windows Azure HDInsight service exposes a set of NET based interfaces to control your clusters programmatically While NET languages like C are a popular choice with many skilled developers HDInsight also has a tight coupling with Windows Azure PowerShell and provides a set of useful cmdlets for cluster management PowerShell is a common choice of Windows administrators for creating a script based management infrastructure The combination of the NET SDK and PowerShell provide an automated way of implementing on demand cluster provisioning and job submission thus leveraging the full flexibility of Azure elastic services In addition to these NET APIs and PowerShell cmdlets there is also a multiplatform aware node js based command line interface that can be used for cluster management programmatically Because storage is isolated and retained in Azure blobs you no longer need to have your Hadoop clusters online and pay for computation hours In this chapter you saw how to use the NET APIs PowerShell and cross platform CLI commands for basic cluster management operations Currently the Hadoop NET SDK provides API access to aspects of HDInsight including HDFS HCatalog Oozie and Ambari There are also libraries for MapReduce and LINQ to
279. oopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log HDP Logging to C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log HDP HDP_INSTALL_PATH C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts HDP HDP_RESOURCES DIR C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources HDP Extracting Java archive into c hadoop HDP C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources winpkg ps1 C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources java zip utils unzip c hadoop WINPKG Logging to existing log C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log WINPKG ENV WINPKG_BIN is C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources WINPKG Setting Environment CurrentDirectory to C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts WINPKG Current Directory C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg scripts WINPKG Package C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources java zip WINPKG Action utils WINPKG Action arguments unzip c hadoop WINPKG Run BuiltInAction C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources java zip C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources utils unzip c hadoop WINPKG Preparing to unzip C HadoopInstallFiles HadoopPackages hdp 1 0 1 winpkg resources java zip to c hadoop WINPKG Finished processing C HadoopInstallFiles HadoopPackages hdp
280. oot of your storage account container in Azure as shown in Figure 6 7 PUXP XP X debarchan supergroup H 2613 11 24 06 55 debarchan PUXP XP X hdpinternaluser supergroup H 2013 11 24 06 36 example Ywxr xEP x hadoopuser supergroup H 2013 12 66 15 11 hadoopuse PWXP XP X debarchan supergroup H 2013 11 24 06 54 hdp PWXP XE X hdp supergroup 2013 11 24 06 36 hive rwxr xP x hdp supergroup H 2613 11 24 06 35 mapred FM Mt admin supergroup H 2013 12 69 13 59 outputi ze hdp supergroup H 2013 11 24 07 05 templeton adoop PUXP XP X SYSTEM supergroup H 2013 11 24 06 35 user Figure 6 7 The 1s command output You can run the word count MapReduce job through the command prompt on the source file provided in the example data gutenburg directory in your WASB to generate the output file much like you did from the NET and PowerShell code in Chapter 5 The command to invoke the MapReduce job is provided in Listing 6 2 93 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Listing 6 2 Running the word count MapReduce job from the Hadoop command line hadoop jar hadoop examples jar wordcount example data gutenberg davinci txt example data commandlineoutput This launches the MapReduce job on the input file and you should see an output similar to Listing 6 3 Listing 6 3 MapReduce command line output 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13 12 09 13
281. op and any traditional Relational Database Management System RDBMS It uses the MapReduce framework under the hood to perform import export operations and often it is a common choice for integrating data from relational and nonrelational data stores In this section we take a quick look at Sqoop operations that are compatible with Microsoft SQL Server on Azure Sqoop is based on Java Database Connectivity JDBC technology to establish connections to remote RDBMS servers Therefore you need the JDBC driver for SQL Server to be installed Table 6 1 summarizes a few of the key Sqoop operations that are supported with SQL Server databases in Azure Table 6 1 Sqoop commands Command Function sqoop import The import command lets you import SQL Server data into WABS You can opt to import an entire table using the table switch or selected records based on criteria using the query switch The data once imported to the Azure storage system is stored as delimited text files or as SequenceFiles for further processing You can also use the import command to move SQL Server data into Hive tables which are like logical schemas on top of WASB sqoop export You can use the export command to move data from WASB into SQL Server tables Much like the import command the export command lets you export data from delimited text files SequenceFiles and Hive tables into SQL Server The export command supports inserting new rows into the target SQL Ser
282. or data visualization tools HDInsight solutions are well suited to performing categorization and normalization of data and for extracting summary results to remove duplication and redundancy This is typically referred to as an Extract Transform and Load ETL process A basic data warehouse or commodity storage mechanism You can use HDInsight to store both the source data and the results of queries executed over this data You can also store schemas or to be precise metadata for tables that are populated by the queries you execute These tables can be indexed although there is no formal mechanism for managing key based relationships between them However you can create data repositories that are robust and reasonably low cost to maintain which is especially useful if you need to store and manage huge volumes of data An integration with an enterprise data warehouse and BI system Enterprise level data warehouses have some special characteristics that differentiate them from simple database systems so there are additional considerations for integrating with HDInsight You can also integrate at different levels depending on the way you intend to use the data obtained from HDInsight Figure 1 5 shows a sample HDInsight deployment as a data collection and analytics tool CHAPTER 1 INTRODUCING HDINSIGHT HDinsight External Data External Data Streaminsight Visualization and Reporting tools Output Reduce Denormalize and
283. orage account key from your Azure Management portal and make sure that you have the correct entry in the core site xml file Azure Throttling Windows Azure Blob Storage limits the bandwidth per storage account to maintain high storage availability for all customers Limiting bandwidth is done by rejecting requests to storage HTTP response 500 or 503 in proportion to recent requests that are above the allocated bandwidth To learn about such storage account limits refer to the following page http blogs msdn com b windowsazure archive 2012 11 02 windows azure s flat network storage and 2012 scalability targets aspx Your cluster will be throttled if or when your cluster is writing data to or reading data from WASB at rates greater than those stated earlier You can determine if you might hit those limits based on the size of your cluster and your workload type 239 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Note Real Hadoop jobs have recurring task startup delays so the actual number of machines required to exceed the limit is generally higher than calculated Some initial indications that your job is being throttled by Windows Azure Storage may include the following e Longer than expected job completion times e A high number of task failures e Job failure Although these are indications that your cluster is being throttled the best way to understand if your workload is being throttled is by inspecting responses returned
284. orksDWH database 154 155 BI tools 147 client side server side component 147 connection string 150 151 decimal data type 154 DimDate table 156 drop down list 150 excel add ins 148 Import Wizard 149 manage icon 148 stock_analysis 147 152 153 156 stock_date 153 stock report see Stock report Powershell code management and readability 80 executing 83 84 execution policy 85 features 81 HDInsightCmdlets advantage 55 cluster provisioning 54 command function 55 command line interface CLI see Command Line Interface CLI hdinsightstorage 53 246 output 53 password compliance policy 54 powershell 51 specified module 52 zip file 52 ISE 82 job submission script 82 83 MapReduce job 80 81 MRRunner see MRRunner NET client 80 uses 85 Power view for excel features 161 insert ribbon 161 NASDAQ and NYSE 162 power BI see Power Business Intelligence stock comparison 162 Public static void ListClusters 45 R Relational database management systems RDBMS 3 S Server Integration Services SSIS 12 Service Trace Logs 187 190 SKEWED BY clause 130 Software development kit SDK see Hadoop NET SDK SQL Azure database creation CUSTOM CREATE option 27 Hive and Oozie data stores 26 MetaStore SQL Azure database 27 options 26 QUICK CREATE option 26 SQL Server Data Tools SSDT 168 SQL Server Integration Services SSIS columns mapping data flow 183 verification of 182 data fl
285. osoft HDInsight Hadoop on Windows HDInsight is Microsoft s implementation of a Big Data solution with Apache Hadoop at its core HDInsight is 100 percent compatible with Apache Hadoop and is built on open source components in conjunction with Hortonworks a company focused toward getting Hadoop adopted on the Windows platform Basically Microsoft has taken the open source Hadoop project added the functionalities needed to make it compatible with Windows because Hadoop is based on Linux and submitted the project back to the community All of the components are retested in typical scenarios to ensure that they work together correctly and that there are no versioning or compatibility issues I m a great fan of such integration because I can see the boost it might provide to the industry and I was excited with the news that the open source community has included Windows compatible Hadoop in their main project trunk Developments in HDInsight are regularly fed back to the community through Hortonworks so that they can maintain compatibility and contribute to the fantastic open source effort Microsoft s Hadoop based distribution brings the robustness manageability and simplicity of Windows to the Hadoop environment The focus is on hardening security through integration with Active Directory thus making it enterprise ready simplifying manageability through integration with System Center 2012 and dramatically reducing the time required to set up
286. ost 9010 grunt gt Let s execute a series of Pig statements to parse the Sample log file that is present in the example data folder by default in WASB containers The first statement loads the file content to a Pig variable called LOGS LOGS LOAD wasb example data sample log Then we will create a variable LEVELS that will categorize the entries in the LOGS variable based on Info Error Warnings and so forth For example LEVELS foreach LOGS generate REGEX _EXTRACT 0 TRACE DEBUG INFO WARN ERROR FATAL 1 as LOGLEVEL Next we can filter out the null entries in the FILTEREDLEVEL variables FILTEREDLEVELS FILTER LEVELS by LOGLEVEL is not null After that we can filter the group entries based on the values in the variable GROUPEDLEVELS GROUPEDLEVELS GROUP FILTEREDLEVELS by LOGLEVEL Next we count the number of occurrences of each entry type and load them in the FREQUENCIES variable For example FREQUENCIES foreach GROUPEDLEVELS generate group as LOGLEVEL COUNT FILTEREDLEVELS LOGLEVEL as COUNT Then we arrange the grouped entries in descending order of their number of occurrences in the RESULTS variable Here s how to sort in that order RESULT order FREQUENCIES by COUNT desc Finally we can print out the value of the RESULTS variable using the DUMP command Note that this is the place where the actual MapReduce job is triggered to process and fetch the data Here s the command DU
287. ou delete this file once it is imported successfully into PowerShell Coding the Application In your HadoopClient solution add a new class to your project and name it Constants cs There will be some constant values such as the subscriptionID certificate thumbprint user names passwords and so on Instead of writing them again and again we are going to club these values in this class and refer to them from our program Listing 4 1 shows the code in the Constants cs file 44 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING Listing 4 1 The Constants cs File using System using System Collections Generic using System Ling using System Text namespace HadoopClient public class Constants public static Uri azureClusterUri new Uri https democluster azurehdinsight net 443 public static string thumbprint your_subscription thumbprint public static Guid subscriptionId new Guid your_ subscription id public static string clusterUser admin public static string hadoopUser hdp public static string clusterPassword your_password public static string storageAccount democluster blob core windows net public static string storageAccountKey your storage key public static string container democlustercontainer public static string wasbPath wasb democlustercontainer democluster blob core windows net When you choose your password make sure to meet the following passwo
288. ou just created as shown in Figure 3 10 29 CHAPTER 3 PROVISIONING YOUR HDINSIGHT SERVICE CLUSTER x NEW HDINSIGHT CLUSTER Configure Cluster User USER NAME hadoop PASSWORD CONFIRM PASSWORD 0000000888 CCC8 00000000800080 Ka Supply Hive Oozie Metastore HIVE META OOZIESTORE DATABASE MetaStore M DATABASE USER DATABASE USER PASSWORD sa ee s Ich 3 Figure 3 10 Configuring the cluster user and Hive metastore Note If you choose QUICK CREATE to create your cluster the default user name is Admin This can be changed only by using the CUSTOM CREATE wizard By default Hive Oozie uses an open source RDBMS system for its storage called Derby It can be embedded in a Java program like Hive and it supports online transaction processing If you wish to continue with Derby for your Hive and Oozie storage you can choose to leave the box deselected Choosing Your Storage Account The next step of the wizard is to select the storage account for the cluster You can use the already created democluster account to associate with the cluster You also get an option here to create a dedicated storage account on the fly or even to use a different storage account from a different subscription altogether This step also gives you the option of creating a default container in the storage account on the fly as shown in Figure 3 11 Be careful though because once as storage account for the cluster is chosen it cannot be changed If the stora
289. ou should see output similar to Listing 5 11 Listing 5 11 Hive job output Executing Hive Job Done List of Tables are aaplstockdata hivesampletable stock_analysis stock_analysis1 Note The hivesampletable is the only table that comes built in as a sample have other tables created so your output may be different based on the Hive tables you have The NET APIs provide the NET developers the flexibility to use their existing skills to automate job submissions in Hadoop This simple console application can be further enhanced to create a Windows Form application and provide a really robust monitoring and job submission interface for your HDInsight clusters Monitoring Job Status The NET SDK also supports the Hadoop supporting package Ambari Ambari is a framework that provides monitoring and instrumentation options for your cluster To implement the Ambari APIs you need to add the NuGet package Microsoft Hadoop WebClient You will also need to import the following namespaces in your Program cs file using Microsoft Hadoop WebClient AmbariClient using Microsoft Hadoop WebClient AmbariClient Contracts Once the references are added create a new function called MonitorCluster and add the code snippet as shown in Listing 5 12 Listing 5 12 MonitorCluster method public static void MonitorCluster var client new AmbariClient Constants azureClusterUri Constants clusterUser Constants clusterPasswor
290. output similar to the following PM gt install package Microsoft Hadoop Hive Attempting to resolve dependency Newtonsoft Json 2 4 5 11 Installing Microsoft Hadoop Hive 0 9 4951 25594 Successfully installed Microsoft Hadoop Hive 0 9 4951 25594 Adding Microsoft Hadoop Hive 0 9 4951 25594 to HadoopClient 71 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Successfully added Microsoft Hadoop Hive 0 9 4951 25594 to HadoopClient Setting MRLib items CopyToOutputDirectory true Once the NuGet package has been added add a reference to the dll file in your code using Microsoft Hadoop Hive Once the references are added you can develop the application code to construct and execute Hive queries against your HDInsight cluster Creating the Hive Queries The Hive NET API exposes a few key methods to create and run Hive jobs The steps are pretty similar to creating a MapReduce job submission Add a new DoHiveOperations method in your Program cs file This method will contain your Hive job submission code As with your MapReduce job submission code the first step is to create your Hive job definition HiveJobCreateParameters hiveJobDefinition new HiveJobCreateParameters JobName Show tables job StatusFolder TableListFolder Query show tables IW Next is the regular piece of code dealing with certificates and credentials to submit and run jobs in the cluster var store new X509Store
291. ow tab 171 tasks 170 destination SQL connection new OLE DB connection 174 testing 175 as an ETL tool 167 hive source component ADO NET source 176 hive table columns 179 Preview Hive query results 178 table selection 177 package execution in 32 bit mode 185 MSDN whitepaper 185 transfer files 184 project creation new project 169 SSDT 168 source hive connection ADO NET connection 171 manager 171 NET ODBC data provider 172 test connection 173 SQL destination component OLE DB destination 180 SQL server table 181 Sqoop job failure 238 Stock report chart 161 DimDate table 157 PivotChart 159 160 power view for excel see Power view for excel stock_volume 159 table 158 Structured Query Language SQL 3 T U V TaskTracker portal 108 219 Threading 66 Troubleshooting cluster deployments 205 cluster creation 205 cluster provisioning process 206 installer logs see Installer logs Troubleshooting job failures cluster connectivity 241 Hive command failure see Hive command failure MapReduce attempt file 226 compression 225 concatenation file 226 core site xml 220 Hadoop JobTracker Log 224 225 jobtracker trace log 222 mapred site xml 222 spilling 226 types 219 Pig jobs see Pig jobs failures sqoop job 238 Windows Azure Storage Blob WASB 16 121 201 authentication 239 throttling 239 240 Troubleshooting powershell deployments write cmdlets debug switch 217 usage descri
292. ows PowerShell Write Output The Write Output cmdlet sends the specified object down the pipeline to the next command If the command is the last command in the pipeline the object is displayed in the console Write Progress The Write Progress cmdlet displays a progress bar in a Windows PowerShell command window that depicts the status of a running command or script You can select the indicators that the bar reflects and the text that appears above and below the progress bar Write Verbose The Write Verbose cmdlet writes text to the verbose message stream in Windows PowerShell Typically the verbose message stream is used to deliver information about command processing that is used for debugging a command Write Warning The Write Warning cmdlet writes a warning message to the Windows PowerShell host The response to the warning depends on the value of the user s WarningPreference variable and the use of the WarningAction common parameter 216 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS Using the debug Switch Another option in PowerShell is to use the debug switch while executing your scripts This switch prints the status messages in the PowerShell command prompt during script execution and can help you debug your script failures A sample output using the debug switch while trying to get cluster details with an incorrect subscription name is similar to the one shown in Listing 12 4 Listing 12 4 The debug switch Get
293. plete job 201311240635 0193 Counters 30 Job Counters Launched reduce tasks 1 SLOTS MILLIS MAPS 8500 Total time spent by all reduces waiting after reserving slots ms 0 13 12 09 19 47 42 INFO mapred JobClient Total time spent by all maps waiting after reserving slots ms 0 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapred JobClient 13 12 09 19 47 42 INFO mapr
294. port the NuGet packages for HDInsight in your Visual Studio application Adding the NuGet Packages To use the HDInsight NuGet packages you need to create a solution first Since we are going perform the cluster management operations that we can see from the Azure portal a console application is good enough to demonstrate the functionalities Launch Visual Studio 2013 and choose to create a new C Console Application from the list of available project types as shown in Figure 4 1 New Project b Recent NET Framework 4 5 Sort by Default 2 Search Installed Templates Ctrl E P 4 Installed ce a a e l Windows Forms Application Visual CS Type Visual C 4 Templates z A project for creating a command line b JavaScript E WPF Application Visual C application 4 Other Languages bad ce b Visual Basic Es Console Application Visual CS b Visual C ER b Visual C BI Class Library Visual C SQL Server ud ER b Visual F A Portable Class Library Visual C b Other Project Types s ce Modeling Projects ES Blank App XAML Visual C Samples ce gt Online ec Grid App XAML Visual CS CR SS Silverlight Application Visual C ce LE Split App XAML Visual CS Cp di Silverlight Class Library Visual C ace S Name HadoopClient Location D S Browse Solution name HadoopClient M Create directory for solution Figure 4 1 New C console application Once the solution is created open the NuGet Package Manager Console to import the req
295. pred JobInitializationPoller Initializing job job_201311240635_0001 in Queue joblauncher For user admin 2013 11 24 07 05 24 324 INFO org apache hadoop mapred JobTracker Initializing job_201311240635_0001 2013 11 24 07 05 24 325 INFO org apache hadoop mapred JobInProgress Initializing job_201311240635_0001 2013 11 24 07 05 24 576 INFO org apache hadoop mapred JobInProgress Input size for job job_201311240635_0001 0 Number of splits 1 2013 11 24 07 05 24 577 INFO org apache hadoop mapred JobInProgress job_201311240635 0001 LOCALITY WAIT FACTOR 0 0 2013 11 24 07 05 24 578 INFO org apache hadoop mapred JobInProgress Job job_201311240635_0001 initialized successfully with 1 map tasks and 0 reduce tasks 2013 11 24 07 05 24 659 INFO org apache hadoop mapred JobTracker Adding task JOB SETUP attempt_201311240635 0001 m 000002 _0 to tip task_201311240635 0001 _m_000002 for tracker tracker_workernode1 127 0 0 1 127 0 0 1 49193 2013 11 24 07 05 28 224 INFO org apache hadoop mapred JobInProgress Task attempt_201311240635_0001_m_000002_0 has completed task_201311240635_0001_m_000002 successfully The highlighted sections of the preceding log gives you the key settings configured to execute this job Because the jobtracker trace log file records the command you can easily figure out which of the parameters are overridden in the command line and which are the ones being inherited from the configuration files and then take appropri
296. ption 216 Troubleshooting visual studio deployments breakpoint 211 IntelliTrace application 212 diagnose problems 215 INDEX events window 214 feature 212 framework components 213 Troubleshooting powershell deployments see Troubleshooting powershell deployments W X Y Z Web interfaces MapReduce status portal 104 Name Node status portal 106 shortcuts 104 Windows Azure HDInsight Microsoft s cloud computing platform Azure management portal 14 Azure services 13 bigdata value 14 Paas 13 services Azure Storage Explorer 19 BI capabilities 14 blob storage 16 Cloud Storage Studio 2 20 CloudXplorer 20 Cluster Version 1 6 15 Cluster Version 2 1 15 container access 17 FNS 21 Quantum 10 network 20 queue storage 16 table storage 16 WASB 16 18 20 Windows Azure Explorer 20 Windows Azure flat network storage 21 Windows Azure HDInsight Emulator 113 203 Hadoop command line 124 installation 114 Hortonworks Data Platform 116 Web PI 115 ls command 124 MapReduce PowerShell script 124 Parallel Data Warehouse 125 polybase 125 verification C Hadoop directory 118 core site xml file 120 Hadoop see Hadoop Name Node portal 118 programs and features list 117 WASB 121 Windows Azure Storage Blob WASB 16 121 201 authentication 239 throttling 239 240 Windows ODBC tracing 198 247 Pro Microsoft HDInsight Debarchan Sarkar Apress Pro Microsoft HDInsight
297. ptionName Your Subscription Name storageAccountName democluster containerName democlustercontainer This path may vary depending on where you place the source csv files fileName C Numbers txt blobName example data Numbers txt Get the storage account key Select AzureSubscription subscriptionName storageaccountkey get azurestoragekey storageAccountName _ Primary Create the storage context object destContext New AzureStorageContext StorageAccountName storageAccountName StorageAccountKey storageaccountkey Copy the file from local workstation to the Blob container Set AzureStorageBlobContent File fileName Container containerName Blob blobName context destContext On successful execution you should see output similar to the following Container Uri https democluster blob core windows net democlustercontainer Name BlobType Length ContentType LastModified SnapshotTime example d BlockBlob 23 applicatio 12 9 2013 You can also verify that the file exists in your blob container through the Azure Management portal as shown in Figure 5 3 democlustercontainer NAME URL demociustercontai example data Numbers bt http democuster biob core windows net democustercontainer example data Numbers bt SampieTabieQueryFoider https demoouster biob core windows net Gemociustercontainer SampleTableQueryFolder SampieTableQueryFoider exit httpy democ ws net Gemoci
298. r portal 108 Hadoop Distributed File System HDFS 89 Hadoop log4j log files 191 Hadoop NET SDK API libraries 39 app config web config 39 C console application 60 client application 42 cluster provisioning 48 complete code 49 50 constants cs file 45 createcluster method 47 48 custom mapreduce job constants class 60 hadoop service account 61 deletecluster method 48 function 41 hive job 71 application 74 DoHiveOperations method 72 73 hive queries 72 243 INDEX Hadoop NET SDK cont NuGet package 72 output 74 mandatory parameters 47 map function 59 mapreduce classes 61 component function 62 overridden reduce method 63 square root 62 SquareRootJob 64 SquareRootOutput 64 monitoring job status Ambari 74 cluster management operations 75 code listing 75 79 monitorcluster method 74 75 NuGet package 39 output 42 application 40 password requirements 45 powershell see Powershell publishsettingsfile 43 reduce function 59 running the mapreduce job blob 65 DoCustomMapReduce method 66 DoMapReduce method 66 execututions 67 jobtracker portal 67 PowerShell script 65 WebHCat endpoint 68 ShowClusters method 48 uses 59 VERBOSE 44 VSIX installer 40 web platform installer 43 wordcount MapReduce Job blob storage 70 definition 69 DoMapReduce function method 69 71 incorrect intermediate output 69 waitforjobcompletion method 69 writeline 46 HDInsig
299. r the FILTEREDLEVELS object you can now issue the following command EXPLAIN FILTEREDLEVELS This command should produce output similar to that in Listing 13 17 Listing 13 17 The Explain command grunt gt EXPLAIN FILTEREDLEVELS 2013 11 22 18 30 55 721 main WARN org apache pig PigServer Encountered Warning IMPLICIT CAST TO CHARARRAY 1 time s 2013 11 22 18 30 55 723 main WARN org apache pig PigServer Encountered Warning USING OVERLOADED FUNCTION 1 time s FILTEREDLEVELS Name LOStore Schema LOGLEVEL 78 chararray FILTEREDLEVELS Name LOFilter Schema LOGLEVEL 78 chararray Name Not Type boolean Uid 80 Name IsNull Type boolean Uid 79 LOGLEVEL Name Project Type chararray Uid 78 Input o Column 0 LEVELS Name LOForEach Schema LOGLEVEL 78 chararray Name LOGenerate false Schema LOGLEVEL 78 chararray Name UserFunc org apache pig builtin REGEX EXTRACT Type chararray Uid 78 Name Cast Type chararray Uid 74 Name Project Type bytearray Uid 74 Input 0 Column Name Constant Type chararray Uid 76 Name Constant Type int Uid 77 Name LOInnerLoad 0 Schema 74 bytearray LOGS Name LOLoad Schema null RequiredFields null FILTEREDLEVELS Store fakefile org apache pig builtin PigStorage scope 57 FILTEREDLEVELS Filter bag scope 53 Not boolean scope 56 POIsNull boolean scope 55 23
300. racing successfully initialized DateTime 2013 11 24T06 35 12 0190000Z Timestamp 3610300511 HadoopServiceTraceSource Information 0 Loading service xml c apps dist hadoop 1 2 0 1 3 1 0 06 bin jobtracker xml 222 CHAPTER 13 TROUBLESHOOTING JOB FAILURES DateTime 2013 11 24T06 35 12 0190000Z Timestamp 3610344009 HadoopServiceTraceSource Information 0 Successfully parsed service xml for service jobtracker DateTime 2013 11 24T06 35 12 0190000Z Timestamp 3610353933 HadoopServiceTraceSource Information O Command line c apps dist java bin java server Xmx4096m Dhadoop log dir c apps dist hadoop 1 2 0 1 3 1 0 06 logs Dhadoop log file hadoop jobtracker RD00155D67172B log Dhadoop home dir c apps dist hadoop 1 2 0 1 3 1 0 06 Dhadoop root logger INFO console DRFA ETW FilterLog Djava library path c apps dist hadoop 1 2 0 1 3 1 0 06 lib native Windows NT amd64 64 c apps dist hadoop 1 2 0 1 3 1 0 06 lib native Dhadoop policy file hadoop policy xml Dcom sun management jmxremote Detwlogger component jobtracker Dwhitelist filename core whitelist res classpath c apps dist hadoop 1 2 0 1 3 1 0 6 conf c apps dist java lib tools jar c apps dist hadoop 1 2 0 1 3 1 0 06 c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop ant 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop client 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0 06 hadoop core 1 2 0 1 3 1 0 06 jar c apps dist hadoop 1 2 0 1 3 1 0
301. raditional database systems such as SQL Server is that Hive adopts a schema on read approach This approach enables you to be flexible about the specific columns and data types that you want to project on top of your data You can create multiple tables with different schema from the same underlying data depending on how you want to use that data The most important point to take away from this approach is that the table is simply a metadata schema that is imposed on data in underlying files Creating Hive Tables You create tables by using the HiveQL CREATE TABLE statement which in its simplest form looks similar to the analogous statement in Transact SQL One thing to note about Hive tables is that you can create two types of tables External and Internal If you do not specify a table type a table is created as Internal Be careful An internal table tells Hive to manage the data by itself If you drop the table by default the data is also dropped and cannot be recovered If you want to manage the data and data locations if your data is used outside Hive or if you need to retain the data create an external table The syntax is pretty much similar requiring just the addition of the EXTERNAL keyword You can use the PARTITIONED BY clause to create a subfolder for each distinct value in a specified column for example to store a file of daily data for each date in a separate folder Partitioning can improve query performance because HDInsigh
302. rd requirements to avoid getting an error when you execute your program e The field must contain at least 10 characters e The field cannot contain the user name e The field must contain one each of the following an uppercase letter a lowercase letter a number a special character Next navigate to the Program cs file in the solution that has the Main function the entry point of a console application You need to add the required references to access the certificate store for the Azure certificate as well as different HDInsight management operations Go ahead and add the following using statements at the top of your Program cs file using System Security Cryptography X509Certificates using Microsoft WindowsAzure Management HDInsight Create a new public function called ListClusters This function will have the code to query the certificate store and list the existing HDInsight clusters under that subscription Listing 4 2 outlines the code for the ListClusters function Listing 4 2 Enumerating Clusters in Your Subscription public static void ListClusters var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt 45 www allitebooks com CHAPTER A AUTOMATING HDINSIGHT CLUSTER PROVISIONING First item gt item Thumbprint Constants thumbprint var creds new HDInsightCertificateCredential Constants subscriptionId cert var client HD
303. red submit jobs for processing and shut the cluster down once the output is written back into Azure blob storage This process is possible because the storage used for input and output is Azure blob storage As such the cluster is needed only for compute operations and not storage During the creation the cluster name and number of hosts can be specified and during the job submission the input and output paths can be specified as well One could of course customize these scripts to include additional parameters such as the number of mappers additional job arguments and so on Note Microsoft consultant Carl Nolan has a wonderful blog about using PowerShell cmdlets to provide a mechanism for managing an elastic service You can read his blog at http blogs msdn com b carlnol archive 2013 06 07 managing your hdinsight cluster with powershell aspx Command Line Interface CLI The command line is an open source cross platform interface for managing HDInsight clusters It is implemented in Node js Thus it is usable from multiple platforms such as Windows Mac Linux and so on The source code is available at the GitHub web site https github com WindowsAzure azure sdk tools xplat The sequence of operations in CLI is pretty much the same as in PowerShell You have to download and import the Azure publishsettings file as a persistent local config setting that the command line interface will use for its subsequent operations 55
304. redentialFilePath c hadoop singlenodecreds xml HADOOP Stopping MapRed services if already running before proceeding with install HADOOP Stopping mapreduce jobtracker tasktracker historyserver services HADOOP Stopping jobtracker service HADOOP Stopping tasktracker service 210 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS HADOOP Stopping historyserver service HADOOP Stopping HDFS services if already running before proceeding with install HADOOP Stopping hdfs namenode datanode secondarynamenode services HADOOP Stopping namenode service HADOOP Stopping datanode service HADOOP Stopping secondarynamenode service HADOOP Logging to existing log C HadoopInstallFiles HadoopSetupTools hdp 1 0 1 winpkg install log The installer log is a great place to review how each of the operations are set up and executed Even if there are no errors during deployment you should refer to this log for a detailed understanding of your own on the sequence of operations during the installation The log is stripped off for brevity It contains the messages for each of the projects that get deployed I have stopped here at Hadoop in your installer log you would see the verbose message for Hive Pig Sqoop and the rest of the projects If there is a component missing after the installation such as Hive you can investigate the install log file scroll down to the section for the respective component and track down the cause of the error
305. river execute gt 2013 11 15 14 25 31 816 INFO ql Driver Driver java execute 1066 Starting command drop database hive 2013 11 15 14 25 31 816 INFO ql Driver PerfLogger java PerfLogEnd 127 lt PERFLOG method TimeToSubmit start 1384525531811 end 1384525531816 duration 5 gt 2013 11 15 14 25 31 846 ERROR exec Task SessionState java printError 432 There is no database named hive NoSuchObjectException message There is no database named hive Atorg apache hadoop hive metastore api ThriftHiveMetastore get_database result get_database_resultStandardScheme read ThriftHiveMetastore java 9883 There could be errors while executing DML commands like SELECT against your Hive tables To understand and troubleshoot such errors you need to know the different phases that a HQL goes through Table 13 1 summarizes the phases of Hive query execution Table 13 1 Hive query execution phases Phases Description Parsing Converts a Query into Parse Tree If there are syntax errors in your query for example a missing semi colon at the end it is likely to be failing at this stage Semantic Analysis Builds a logical plan based on the information retrieved from the Hive metastore database Metadata failure errors where the underlying schema has changed after the query is submitted are reported in this phase Physical Plan Generation Converts the logical plan to a physical plan that generates a Direct Acyclic Graph of MapReduce
306. rom the Microsoft f Windows Azure Marketplace 5 e 2 From Hadoop File HDFS gt Import data from a Hadoop 7 Distributed File System 8 9 LO z F From Active Directory L L2 LA Import data from Microsoft From Windows Azure HDInsight 13 Directory 43 Microsoft Power Query for Excel l4 F From Facebook a Query L5 Import data from Facebook z ii dhai i L6 7 T Figure 9 23 Connecting Power Query to Azure HDInsight Excel will prompt you for the cluster storage account name as shown in Figure 9 24 Provide your storage account name and click OK Microsoft Windows Azure HDInsight Enter the name of the Azure Blob Storage account associated with your HDInsight cluster Account Name demociuster Figure 9 24 Provide a cluster storage account name 164 CHAPTER 9 CONSUMING HDINSIGHT FROM SELF SERVICE BI TOOLS When you connect for the first time Excel will also prompt you for your storage account key Enter that key and click on Save Click on Edit Query to load the Query Editor screen where you can specify your filter criteria Expand the drop down list under the Name column and filter only the rows that have csv files as shown in Figure 9 25 Query Editor _ ait Query1 been Preview Cowmoaded at 216 AM P lt Comet Name Emention v Dateaccessed Date modified Date crested e Folder Path deii 1 St KL Sort Ascending zt Sun 24 Nov2013 06 49 23 GMT tps J Semac
307. rticular stores and determine the timing of price markdowns CHAPTER 1 INTRODUCING HDINSIGHT e Questions that require advanced analytics An examples of this type is a credit card system that uses machine learning to build better fraud detection algorithms The goal is to go beyond the simple business rules involving charge frequency and location to also include an individual s customized buying patterns ultimately leading to a better experience for the customer Organizations that take advantage of Big Data to ask and answer these questions will more effectively derive new value for the business whether it is in the form of revenue growth cost savings or entirely new business models One of the most obvious questions that then comes up is this What is the shape of Big Data Big Data typically consists of delimited attributes in files for example comma separated value or CSV format or it might contain long text tweets Extensible Markup Language XML Javascript Object Notation JSON and other forms of content from which you want only a few attributes at any given time These new requirements challenge traditional data management technologies and call for a new approach to enable organizations to effectively manage data enrich data and gain insights from it Through the rest of this book we will talk about how Microsoft offers an end to end platform for all data and the easiest to use tools to analyze it Microsoft s data platform
308. s you may get an exception as shown next when you try to run PowerShell scripts that use d11 files that are compiled and signed externally PS C gt SubmitJob ps1 SubmitJob ps1 File C SubmitJob ps1 cannot be loaded because running scripts is disabled on this system For more information see about_Execution Policies at http go microsoft com fwlink LinkID 135170 At line 1 char 1 SubmitJob ps1 If you encounter such a problem you need to explicitly set the PowerShell execution policy using the following command Set ExecutionPolicy RemoteSigned While setting the execution policy accept any warnings you might get in the PowerShell console It is also possible to submit Hive jobs using PowerShell much like the NET SDK Carl Nolan has a great blog that covers Hive job submission through PowerShell http blogs msdn com b carlnol archive 2013 06 18 managing hive job submissions with powershell aspx Using MRRunner To submit MapReduce jobs HDInsight distribution offers a command line utility called MRRunner which could be utilized as well apart from the NET SDK and the HDInsight PowerShell cmdlets Again to support the MRRunner utility you should have an assembly a NET dll that defines at least one implementation of HadoopJob lt gt If the dll contains only one implementation of HadoopJob lt gt like our HadoopClient d11 does you can run the job with the following MRRunner dll MyD11 If the d cont
309. s Data Platform Microsoft HDInsight on Azure and component bundles like Hive Pig Hcatalog Sqoop and Oozie are run Running The cluster is ready for use A few scenarios apart from the ones shown in the preceding table can lead to failure during the cluster provisioning process e Race condition exists on cluster creation An operation to create the hidden Cloud Service object was not synchronous A subsequent call to retrieve the Cloud Service to use in the next step failed e VM cores are limited by subscription Attempts to create a cluster using cores past the subscription limit failed e Datacenter capacity is limited Because HDInsight clusters can use a large number of cores Cluster creation failures can occur when the datacenter is near capacity e Certain operations have must succeed logging attached to them If the underlying logging infrastructure Windows Azure Tables is not available or times out the cluster creation effort may fail Installer Logs The Windows Azure HDInsight Service has a mechanism to log its cluster deployment operations Log files are placed in the C HDInsightLogs directory in the name node and data nodes They contain two types of log files e AzureInstallHelper log e DeploymentAgent log These files give you information about several key aspects of the deployment process Basically after the VMs are provisioned a deployment service runs for HDInsight that unpacks and installs Hadoop and
310. s at the map level e From the command line it ll report that a map join is being done because it is pushing a smaller table up to memory e And right at the end there is a call out that it s converting the join into MapJoin The command line output or the Hive logs will have snippets indicating that a map join has happened as you can see in Listing 13 14 Listing 13 14 hive log file 2013 11 26 10 55 41 Starting to launch local task to process map join maximum memory 932118528 2013 11 26 10 55 45 Processing rows 200000 Hashtable size 199999 Memory usage 145227488 rate 0 158 2013 11 26 10 55 47 Processing rows 300000 Hashtable size 299999 Memory usage 183032536 rate 0 188 233 CHAPTER 13 TROUBLESHOOTING JOB FAILURES 2013 11 26 10 55 49 Processing rows 330936 Hashtable size 330936 Memory usage 149795152 rate 0 166 2013 11 26 10 55 49 Dump the hashtable into file file tmp msgbigdata hive_2013 11 26 _22 55 34_959_3143934780177488621 1local 10002 HashTable Stage 4 MapJoin mapfileo1 hashtable 2013 11 26 10 55 56 Upload 1 File to file tmp msgbigdata hive_2013 11 26 _22 55 34_959_3143934780177488621 local 10002 HashTable Stage 4 MapJoin mapfileo1 hashtable File size 39685647 2013 11 26 10 55 56 End of local task Time Taken 13 203 sec Execution completed successfully Mapred Local Task Succeeded Convert the Join into MapJoin Mapred Local Task Succeeded Convert the Join into MapJoin Launching Job
311. s consumed in today s world and it has always been like this For example today a standard international flight generates around 5 terabytes of operational data That is during a single flight Big Data solutions were already implemented long ago back when Google Yahoo Bing search engines were developed but these solutions were limited to large enterprises because of the hardware cost of supporting such solutions This is no longer an issue because hardware and storage costs are dropping drastically like never before New types of questions are being asked and data solutions are used to answer these questions and drive businesses more successfully These questions fall into the following categories e Questions regarding social and Web analytics Examples of these types of questions include the following What is the sentiment toward our brand and products How effective are our advertisements and online campaigns Which gender age group and other demographics are we trying to reach How can we optimize our message broaden our customer base or target the correct audience e Questions that require connecting to live data feeds Examples of this include the following a large shipping company that uses live weather feeds and traffic patterns to fine tune its ship and truck routes to improve delivery times and generate cost savings retailers that analyze sales pricing economic demographic and live weather data to tailor product selections at pa
312. s dist hadoop 1 2 0 1 3 0 1 0302 logs history job_201311120315_0003_conf xml to wasb democlustercontainer democluster blob core windows net mapred history done version 1 jobtrackerhost_1384226104721_ 2013 11 16 000000 The JobTracker log files are pretty verbose If you go through them carefully you should be able to track down and resolve any errors in your Hive data processing jobs Troubleshooting can be tricky however if the problem is with job performance If your Hive queries are joining multiple tables and their different partitions the query response times can be quite long In some cases they will need manual tuning for optimum throughput To that end the following subsections provide some best practices leading toward better execution performance Compress Intermediate Files A large volume of intermediate files are generated during the execution of MapReduce jobs Analysis has shown that if these intermediate files are compressed job execution performance tends to be better You can execute the following SET commands to set compression parameters from the Hadoop command line set mapred compress map output true set mapred map output compression codec org apache hadoop io compress GzipCodec set hive exec compress intermediate true 232 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Note Currently HDInsight supports Gzip and BZ2 codecs Configure the Reducer Task Size In majority of the MapReduce job execution scen
313. s for Hadoop services At a glance this portal gives you an overview of how your cluster is doing as shown in Figure 6 13 NameNode namenodehost 9000 Started Tue Dec 10 02 46 59 GMT 2013 Version 1 2 0 1 3 1 0 06 f4cb3bb77cf3cc20c863de73bd6ef21 cfO69f66f Compiled Wed Oct 02 21 38 25 Coordinated Universal Time 2013 by jenkins Upgrades There are no upgrades in progress Browse the filesystem Namenode Logs Cluster Summary 34 files and directories 481 blocks 1015 total Heap Size is 382 69 MB 3 56 GB 10 Configured Capacity 1 95 TB DFS Used 63 69MB Non DFS Used 17 77 GB DFS Remaining i 1 94 TB DFS Used 0 DFS Remaining 99 11 Live Nodes S 2 Dead Nodes S 0 Decommissioning Nodes 0 Number of Under Replicated Blocks 468 NameNode Storage Storage Directory Type State cAndfs nn IMAGE_AND_EDITS Active Figure 6 13 The Name Node Status Portal 106 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE You can drill down on the data nodes access their file system and go all the way to job configurations used during job submission as shown in Figure 6 14 Contents of directory mapred userhistory _logs history Goto mapreduserhistory _logs go Go to parent directory Name Type Size Replication Block Modification Permission Owner Size Time file 9 1KB 3 256 2013 11 24 wer r admin MB 07 05 job 201311240635 0001 conf xml fle 53 68 3 256 2013 11 24 rw r r ad
314. s storageAccountKey DefaultStorageContainer Constants container 47 CHAPTER A AUTOMATING HDINSIGHT CLUSTER PROVISIONING UserName Constants clusterUser Password Constants clusterPassword ClusterSizeInNodes 2 J var clusterDetails client CreateCluster clusterInfo ListClusters When you execute this method by similarly adding a call in Main you will see that a new cluster deployment has started in the Windows Azure Management Portal as shown in Figure 4 7 NAME STATUS SUBSCRIPTION NAME LOCATION AutomatedHDiCluster Windows Azure VM Configuration datadork VW Running ea East US democluster Running Rome East US Figure 4 7 New cluster provisioning Once the virtual machines are configured and the cluster creation is complete you will see the cluster URL in your console application output For example Created cluster https AutomatedHDICluster azurehdinsight net You can call the ShowClusters method again and this time it will display three HDInsight clusters along with the new one just deployed Cluster AutomatedHDICluster Nodes 2 Cluster datadork Nodes 4 Cluster democluster Nodes 4 You can also drop a cluster using the DeleteCluster method of the NET SDK The code snippet in Listing 4 4 shows how to call the DeleteCluster function Listing 4 4 The DeleteCluster Method public static void DeleteCluster var store new X509Store store Open OpenFlags ReadOnly
315. seamlessly manages any data relational nonrelational and streaming of any size gigabytes terabytes or petabytes anywhere on premises and in the cloud and it enriches existing data sets by connecting to the world s data and enables all users to gain insights with familiar and easy to use tools through Office SQL Server and SharePoint How Is Big Data Different Before proceeding you need to understand the difference between traditional relational database management systems RDBMS and Big Data solutions particularly how they work and what result is expected Modern relational databases are highly optimized for fast and efficient query processing using different techniques Generating reports using Structured Query Language SQL is one of the most commonly used techniques Big Data solutions are optimized for reliable storage of vast quantities of data the often unstructured nature of the data the lack of predefined schemas and the distributed nature of the storage usually preclude any optimization for query performance Unlike SQL queries which can use indexes and other intelligent optimization techniques to maximize query performance Big Data queries typically require an operation similar to a full table scan Big Data queries are batch operations that are expected to take some time to execute You can perform real time queries in Big Data systems but typically you will run a query and store the results for use within your existing
316. see the following MSDN article http msdn microsoft com en us library 5557y8b4 v vs 90 aspx Note Breakpoints are active only when using the Visual Studio debugger When executing a program that has been compiled in release mode or when the debugger is not active breakpoints are unavailable Using IntelliTrace IntelliTrace is a feature introduced in Visual Studio 2010 Ultimate that makes the life of a developer much easier when it comes to debugging Visual Studio collects data about an application while it s executing to help developers diagnose errors The collected data is referred to as IntelliTrace events These events are collected as part of the default debugging experience and among other things they let developers step back in time to see what happened in an application without having to restart the debugger It is particularly useful when a developer needs a deeper understanding of code execution by providing a way to collect the complete execution history of an application Enable IntelliTrace for your application from the Debug gt IntelliTrace gt Open IntelliTrace Settings menu as shown in Figure 12 4 212 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS Options SI X gt Environment V Enable IntelliTrace Projects and Solutions Source Control Collect the following IntelliTrace information while debugging gt Text Editor Debugging IntelliTrace events only 4 IntelliTr
317. senetencnatenencenae 183 Summa soina sadinta based tah deeds daauhsdad cavaeied ua vied sa duaieoem aaae 185 Chapter 11 Logging in HDinsight sssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 187 SOVICE LOUS E 187 DEVICE EE E 187 S rvice Wrapper FICS caicccccctccteccctccceccctcccccecteccascetcocecoctsoccscetcocecoctecccscctcocecoctcocascetcocecocteoccsoetcodecoctececscetceveseeasede 190 SIEM VICE EMMON FIGS eet 190 Hadoop log4j Log EE 191 EE Frame Work EE 194 Windows ODBC Tracing s ssssssssssssssssnnsusnnnnnnnnnununnnnununnnnnnununnnnnnnnunnnnnnnnununnnnnnununnnnnnunnnnnnnnnnnn nna 198 CONTENTS Logging Windows Azure Storage Blob Operations ccsssesessessseseseseeseesesseesesseneaneaes 201 Logging in Windows Azure HDInsight Emulator cccssssssssessessesesesseesesseeseseassesseneaneaes 203 SOU 204 Chapter 12 Troubleshooting Cluster Deployments nssssssnsnnnnnnnnunnnunnnnnnnnnnnnnnnnnnnnnnnann 205 Cluster 0 Beene eee ree een nee ee ee eee eee eee ee eee 205 te cco ee seen E E 206 Troubleshooting Visual Studio Deployments ccccseecseseseesesesseesessessesseesessessesseneeneaees 211 Using Breakpoints reese eege 211 Using le 212 Troubleshooting PowerShell Deployments ccccsescseseeseseesesesseeseeseeseeseeseesessessenseeeanes 216 Using the Write dl E 216 Thu debug RE EE 217 TTT Es EEN 217 Chapter 13 Troubleshooting Job Failures sssunsse
318. sing familiar tools such as Excel and a SQL like language without having to write complex MapReduce jobs Hive queries are broken down into MapReduce jobs under the hood and they remain a complete abstraction to the user The simplicity and SQL ness of Hive queries has made Hive a popular and preferred choice for users That is particularly so for users with traditional SQL skills because the ramp up time is so much less than what is required to learn how to program MapReduce jobs directly Figure 8 2 gives an overview of the Hive architecture 128 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Hive Query Language HQL ODBC JDBC Thrift Server Hive Web Command Line Interface HWI Interface CLI MetaStore Compiler Optimizer Executor Figure 8 2 Hive architecture In effect Hive enables you to create an interface layer over MapReduce that can be used in a similar fashion to a traditional relational database This layer enables business users to use familiar tools like Excel and SQL Server Reporting Services to consume data from HDInsight as they would from a database system such as SQL Server remotely through an ODBC connection The rest of this chapter walks you through different Hive operations and using the Hive ODBC driver to consume the data Working with Hive Hive uses tables to impose schema on data and provides a query interface for client applications The key difference between Hive tables and those in t
319. ssesessssssessesssessssssesaesaesessassassaesassaseasenssanates 89 Hadoop E Kr TE E 92 Mie EE ET 96 Mhe Sdoop 620 T minsn ecere Perret pete cee aaan EG 97 HIER PUG KEE 101 Hadoop El nl 104 Hadoop MapReduce Status icc aivann i ahaa baal ie AR ae WANNA AG ANA ANARNADALANANARY 104 Mhe Name Node Status POR al sa rennan anaa aAa ea 106 Whe Task Tracker VT 107 HDIinsight Windows ServiCes cscscsssssessessessessessesessessessesseseesoesoeseesoeseesoeseesoeseesoesoeas 108 Installation IRB GUOVY E 110 SNE 111 Chapter 7 Using Windows Azure HDInsight Emulator s ssssnnssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnna 113 Installing Au ES lU 114 Verifying the E UE TEE 116 Using the EmUulator ccsssscsscsscssssessesssssessessessesseseeseeseeseesessessassassassassassassassassassassassaseaseates 124 PUES DIO CU ONS itecarnccsccornccdeterscsdcierersciecarsersesacerseieckerscrecnsecsecsrieraacousenecdecsersedeiwencdetscsecseteres 125 SUMMATY ca a 2 a a SO te et 125 Chapter 8 Accessing HDinsight over Hive and ODBC ssssnnnsnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnna 127 Hive The Hadoop Data Warehouse scsssscssessessessessessessesseseeseesseseeseeseessesoeseeseesoesoess 127 Working with UE 129 Creating Hal CRE LTE 129 BN UE EE 134 Querying Tables ue 135 Hive Stra a ee 137 TNE Hive HEEN 137 Installing THE DIIVEE seceseccvssctecdvasccesassbavstewstetstd vabevatausSavabavsbe ves aucls sabansSeuebawahscedaeebauetewibaustewebera
320. st Hadoop Map Reduce Administration State RUNNING Started Tue Dec 10 02 47 00 GMT 2013 Version 1 2 0 1 3 1 0 06 rf4cb3bb77cf3cc20c863de7 3bdbef2 1cfO69f66F Compiled Vved Oct 02 21 38 25 Coordinated Universal Time 2013 by jenkins Identifier 201312100246 SafeMode OFF Cluster Summary Heap Size is 616 38 MB 3 56 GB Running Running Total Occu Occupied Reserved Reduce pied Reserved Map Task Avg Map Reduce i Nodes Reduce Reduce P Task Tasks Tasks Submissions Map Slots Slots Map Slots Slots Capacity Capacity TasksiNode o o 14 2 0 0 jo o 4 em Figure 6 11 The MapReduce Status portal You can scroll down to see the list of completed jobs running jobs which would populate only if a job is running at that point failed jobs and retired jobs You can click on any of the job records to view more details about that specific operation as shown in Figure 6 12 Status Succeeded Started at Tue Dec 10 03 53 54 GMT 2013 Finished at Tue Dec 10 03 54 19 GMT 2013 Finished in 25sec Job Cleanup Successful Job Scheduling information 0 running map tasks using 0 map slots 0 additional slots reserved 0 running reduce tasks using 0 T Kind Complete Num Tasks Pending Running Complete Killed e 100 00 4 0 o ilo oro 100 00 0 0 d o 0 oun Counter Map Reduce Total SLOTS_MILLIS_MAPS o ol 22 047 Total time spent by all red
321. stance of an object System NullReferenceException O Debugger Stopped at Exception CreateCluster Program cs line 64 Debugger Exception Intercepted CreateCluster Program cs line 64 Figure 12 6 IntelliTrace calls view Note that once you are in the calls view the link in the IntelliTrace window toggles to IntelliTrace Events View IntelliTrace can greatly improve both your day to day development activities and your ability to quickly and easily diagnose problems without having to restart your application and debug with the traditional break step inspect technique This is just a brief overview of the feature If you are interested you can get more information about IntelliTrace at the following MSDN link http msdn microsoft com en us library vstudio dd286579 aspx 215 CHAPTER 12 TROUBLESHOOTING CLUSTER DEPLOYMENTS Troubleshooting PowerShell Deployments Windows Azure PowerShell cmdlets provide another way to automate HDInsight cluster provisioning Basically you can use Windows PowerShell to perform a variety of tasks in Windows Azure either interactively at a command prompt or automatically through scripts Windows Azure PowerShell is a module that provides cmdlets to manage Windows Azure through Windows PowerShell You can use the cmdlets to create test deploy and manage your HDInsight clusters on the Windows Azure p
322. stants azureClusterUri Constants clusterUser Constants clusterPassword IList lt ClusterInfo gt clusterInfos client GetClusters ClusterInfo clusterInfo clusterInfos 0 Console WriteLine Cluster Href 0 clusterInfo Href Regex clusterNameRegEx new Regex w var clusterName clusterNameRegEx Match Constants azureClusterUri Authority Groups 1 Value HostComponentMetric hostComponentMetric client GetHostComponentMetric clusterName azurehdinsight net Console WriteLine Cluster Map Reduce Metrics Console WriteLine tMaps Completed t 0 hostComponentMetric MapsCompleted Console WriteLine tMaps Failed t o hostComponentMetric MapsFailed Console WriteLine tMaps Killed t o hostComponentMetric MapsKilled Console WriteLine tMaps Launched t 0 hostComponentMetric MapsLaunched Console WriteLine tMaps Running t o hostComponentMetric MapsRunning Console WriteLine tMaps Waiting t o hostComponentMetric MapsWaiting Helper Function to Wait while job executes private static void WaitForJobCompletion JobCreationResults jobResults IJobSubmissionClient client JobDetails jobInProgress client GetJob jobResults JobId while jobInProgress StatusCode JobStatusCode Completed amp amp jobInProgress StatusCode JobStatusCode Failed jobInProgress client GetJob jobInProgress JobId Thread Sleep TimeSpan FromSeconds 1 Conso
323. t orage java 222 190 CHAPTER 11 LOGGING IN HDINSIGHT Iam running all the services for my demo cluster in the name node itself My set of Hadoop service log files for cluster version 2 1 looks like those shown in Figure 11 2 E EEN Ze Lk v Computer v Local Disk C apps v dist x hadoop 1 2 0 1 3 1 0 06 v bin ize e Include in library Share with New folder S Name Date modified Type Favorites Desktop jobtracker trace log 12 10 2013 2 47 AM Text Document Downloads __ namenode trace log 12 10 2013 2 47 AM Text Document Recent Places secondarynamenode trace log 12 10 2013 2 47 AM Text Document Figure 11 2 Hadoop service log files The service log files are common for all the services listed in Table 11 1 That means that each of the service based projects like Hive and so on have these sets of service log files in their respective bin folders Hadoop log4j Log Files When you consider that HDInsight is essentially a wrapper on top of core Hadoop it is no surprise that it continues to embrace and support the traditional logging mechanism by Apache You should continue to investigate these log files for most of your job failures authentication issues and service communication issues In the HDInsight distribution on Azure these logs are available in the C apps dist hadoop 1 2 0 1 3 1 0 06 logs directory of the respective nodes for Hadoop By default the log files are recycled daily at midnight
324. t 750 maxJobsToAccept 7500 maxActiveTasks 200000 maxJobsPerUserToInit 750 maxJobsPerUserToAccept 7500 maxActiveTasksPerUser 100000 2013 11 24 06 35 19 367 INFO org apache hadoop mapred CapacityTaskScheduler 224 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Initializing default queue with cap 75 0 maxCap 1 0 ulMin 100 ulMinFactor 100 0 supportsPriorities false maxJobsToInit 2250 maxJobsToAccept 22500 maxActiveTasks 200000 maxJobsPerUserToInit 2250 maxJobsPerUserToAccept 22500 maxActiveTasksPerUser 100000 2013 11 24 07 05 16 099 INFO org apache hadoop mapred JobTracker jobToken generated and stored with users keys in mapred system job_201311240635 0001 jobToken 2013 11 24 07 05 16 796 INFO org apache hadoop mapred JobInProgress jjob_201311240635_0001 nMaps 1 nReduces 0 max 1 2013 11 24 07 05 16 799 INFO org apache hadoop mapred JobQueuesManager Job job_201311240635_0001 submitted to queue joblauncher 2013 11 24 07 05 16 800 INFO org apache hadoop mapred JobTracker Job job_201311240635_0001 added successfully for user admin to queue joblauncher 2013 11 24 07 05 16 803 INFO org apache hadoop mapred AuditLogger USER admin IP XX XX XX XX OPERATION SUBMIT_ JOB TARGET job_ 201311240635 0001 RESULT SUCCESS 2013 11 24 07 05 19 329 INFO org apache hadoop mapred JobInitializationPoller Passing to Initializer Job Id job 201311240635 0001 User admin Queue joblauncher 2013 11 24 07 05 24 324 INFO org apache hadoop ma
325. t the ODBC endpoint that the HDInsight service exposes for client applications Once you install and configure the ODBC driver correctly you can consume the Hive service running on HDInsight from any ODBC compliant client application This chapter takes you through the download installation and configuration of the driver to the successful connection to HDInsight www allitebooks com INTRODUCTION Chapter 9 Consuming HDInsight from Self Service BI Tools is a particularly interesting chapter for readers who have a BI background This chapter introduces some of the present day self service BI tools that can be set up with HDInsight within a few clicks With data visualization being the end goal of any data processing framework this chapter gets you going with creating interactive reports in just a few minutes Chapter 10 Integrating HDInsight with SQL Server Integration Services covers the integration of HDInsight with SQL Server Integration Services SSIS SSIS is a component of the SQL Server BI suite and plays an important part in data processing engines as a data extract transform and load tool This chapter guides you through creating an SSIS package that moves data from Hive to SQL Server Chapter 11 Logging in HDInsight describes the logging mechanism in HDInsight There is built in logging in Apache Hadoop on top of that HDInsight implements its own logging framework This chapter enables readers to learn abo
326. t will scan only relevant partitions in a filtered query 129 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC You can use the SKEWED BY clause to create separate files for each row where a specified column value is in a list of specified values Rows with values not listed are stored in a single other file You can use the CLUSTERED BY clause to distribute data across a specified number of subfolders described as buckets based on the values of specified columns using a hashing algorithm There are a few of ways to execute Hive queries against your HDInsight cluster e Using the Hadoop Command Line e Using NET SDK e Using Windows Azure PowerShell In this chapter we use Windows Azure PowerShell to create populate and query Hive tables The Hive tables are based on some demo stock data of different companies as specified here e Apple e Facebook e Google e MSFT e IBM e Oracle Let s first load the input files to the WASB that our democluster is using by executing the following PowerShell script in Listing 8 1 The input files used in this book are just a subset of the stock market dataset available for free at www infochimps com and is provided separately Listing 8 1 Uploading files to WASB subscriptionName lt YourSubscriptionname gt storageAccountName democluster containerName democlustercontainer This path may vary depending on where you place the source csv files fileName D HDIDemoLab
327. tStorageContainerName democluster UserName admin Password Trek ClusterSizeInNodes 2 Your Windows Azure Management Portal will soon display the progress of your cluster provisioning as shown in Figure 4 11 NAME STATUS SUBSCRIPTION NAME LOCATION VERSION datadork gt V Running democluster VW Running Gyn Est US 13 ap utomatedHDI Windows Azure VM East US 15 Figure 4 11 Cluster provisioning in progress On completion of the cluster creation you will see the PowerShell prompt displaying the details of the newly created cluster PS C gt New AzureHDInsightCluster SubscriptionId subid Certificate cert Name AutomatedHDI Location East US DefaultStorageAccountName hdinsightstorage blob core windows net DefaultStorageAccountKey key1 DefaultStorageContainerName democluster UserName admin Password EEEE EK K K kkk kk ClusterSizeInNodes 2 Name AutomatedHDI ConnectionUrl https AutomatedHDI azurehdinsight net State Running CreateDate 9 8 2013 3 34 07 AM UserName admin Location East US ClusterSizeInNodes 2 If there is an error in the specified command the PowerShell console will show you the error messages For example if the supplied cluster password does not meet the password compliance policy you will see an error message similar to the following while trying to provision a new cluster New AzureHDInsightCluster Unable to complete the Create operation Operation fai
328. te a report from the data e Use SQL Server Integration Services SSIS to transfer and if required transform HDInsight results to a database or file location for reporting If the results are exposed as Hive tables you can use an ODBC data source in an SSIS data flow to consume them Alternatively you can create an SSIS control flow that downloads the output files generated by HDInsight and uses them as a source for a data flow Summary In this chapter you saw the different aspects and trends regarding data processing and analytics Microsoft HDInsight is a collaborative effort with the Apache open source community toward making Apache Hadoop an enterprise class computing framework that will operate seamlessly regardless of platform and operating system Porting the Hadoop ecosystem to Windows and combining it with the powerful SQL Server Business Intelligence suite of products opens up different dimensions in data analytics However it s incorrect to assume that HDInsight will replace existing database technologies Instead it likely will be a perfect complement to those technologies in scenarios that existing RDBMS solutions fail to address 12 CHAPTER 2 Understanding Windows Azure HDInsight Service Implementing a Big Data solution is cumbersome and involves significant deployment cost and effort at the beginning to set up the entire ecosystem It can be a tricky decision for any company to invest such a huge amount of
329. that folder you should have a file named job_201311120315 0003 _conf xml The content of that file gives information about all the environment variables and configuration details for that MapReduce job 230 CHAPTER 13 TROUBLESHOOTING JOB FAILURES The TaskTracker logs come into play when the Hive queries are through the physical plan generation phase and into the MapReduce phase From that point forward TaskTracker logs will have a detailed tracing of the operations performed Note that the individual tasks are executed on the data nodes hence the TaskTracker logs are available in the data nodes only The NameNode maintains the log files for the JobTracker service in the same C apps dist hadoop 1 2 0 1 3 0 1 0302 logs folder The JobTracker service is responsible for determining the location of the data blocks maintaining co ordination with and monitoring the TaskTracker services running on different data nodes The file name is Hadoop jobtracker lt node name gt log You can open the file and its contents should be similar to Listing 13 12 Listing 13 12 The JobTracker Log 2013 11 16 17 28 29 781 INFO org apache hadoop mapred JobTracker Initializing job_201311120315 0003 2013 11 16 17 28 29 781 INFO org apache hadoop mapred JobInProgress Initializing job_201311120315 0003 2013 11 16 17 28 29 952 INFO org apache hadoop mapred JobInProgress Input size for job job_201311120315 0003 5015508 Number of splits 1 2013 11 16 1
330. the JobTracker log just after a MapReduce job is started 223 CHAPTER 13 TROUBLESHOOTING JOB FAILURES Listing 13 6 Hadoop JobTracker Log 2013 11 24 06 35 12 972 INFO org apache hadoop mapred JobTracker STARTUP_MSG RRA AAA K K KK K KK K AA A A AA A K A A A AE A AA KK AA A A A A K K KKK KKK K KK STARTUP_MSG Starting JobTracker STARTUP_MSG host RD00155XXXXXX XXX XX XX XX STARTUP_MSG args STARTUP_MSG version 1 2 0 1 3 1 0 06 STARTUP_MSG build git github com hortonworks hadoop monarch git on branch no branch r 4cb3bb77cf3cc20c863de73bd6ef21cf069f66F compiled by jenkins on Wed Oct 02 21 38 25 Coordinated Universal Time 2013 STARTUP_MSG java 1 7 0 internal AEA AR K K A AACA A AEA A AA A AA A AEA A AEA A A A K AEA A AAC A K K KK KKK KKK KKK KKK 2013 11 24 06 35 13 925 WARN org apache hadoop metrics2 impl MetricsSystemImp1 Source name ugi already exists 2013 11 24 06 35 13 925 INFO org apache hadoop security token delegation AbstractDelegationTokenSecretManager Updating the current master key for generating delegation tokens 2013 11 24 06 35 13 940 INFO org apache hadoop mapred JobTracker Scheduler configured with memSizeForMapSlotOnJT memSizeForReduceSlotOnJT limitMaxMemForMapTasks limitMaxMemForReduceTasks 1 1 1 1 2013 11 24 06 35 14 347 INFO org apache hadoop http HttpServer listener getLocalPort returned 50030 webServer getConnectors 0 getLocalPort returned 50030 2013 1
331. the Hadoop text input is processed and each input line is passed into the Map function which parses and filters the key value pair for the data The values are then sorted and merged by Hadoop The processed mapped data is then passed into the Reduce function as a key and corresponding sequence of strings which then defines the optional output value One important thing to keep in mind is that Hadoop Streaming is based on text data Thus the inputs into the MapReduce are strings or UTF8 encoded bytes However when you are performing the MapReduce operations strings are not always suitable but the operations do need to be able to be represented as strings 59 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Adding the References Open the C console application HadoopClient that you created in the previous chapter Once the solution is opened open the NuGet Package Manager Console and import the MapReduce NuGet package by running the following command install package Microsoft Hadoop MapReduce This should import the required dll along with any dependencies it may have You will see output similar to the following PM gt install package Microsoft Hadoop MapReduce Attempting to resolve dependency Newtonsoft Json 2 4 5 11 Installing Newtonsoft Json 4 5 11 Successfully installed Newtonsoft Json 4 5 11 Installing Microsoft Hadoop MapReduce 0 9 4951 25594 Successfully installed Microsoft Hadoop MapReduce 0 9 4951
332. the Windows Azure Storage Blob independent of your cluster It is very important to react and take corrective actions quickly when there is a job failure This chapter focused on different types of jobs you can submit to your cluster and how to troubleshoot such a job failure The chapter also covered some of the key Azure storage related settings that could come in handy while troubleshooting an error or a performance problem as well as the steps to diagnose connectivity failures to your cluster using the Hive ODBC driver 242 Index A Authentication mechanism 143 B Business intelligence BI 14 C CLUSTERED BY clause 130 Command Line Interface CLI executing 56 installation 56 running state 57 usage 56 57 D Destination SQL connection new OLE DB connection 174 testing 175 E Extensible Markup Language XML 3 F G Flat Network Storage FNS 21 H Hadoop C Hadoop directory 118 command line 91 92 124 core site xml 96 democluster WABS container 96 HDFS directory structure 93 Hive console 96 ls command 93 MapReduce job 94 output file s creation 95 pig console 101 sqoop console 97 start stop cluster command 95 log files 120 MapReduce status portal 91 Name Node status portal 91 portals 117 REST APIs 121 service control files 122 start onebox cmd file 122 123 web interfaces MapReduce status portal 104 Name Node status portal 106 shortcuts 104 TaskTracke
333. tion Services SSIS as discussed in Chapter 10 Unlike SSIS Pig does not have a control flow system Pig is written in Java and produces Java jar code to run MapReduce jobs across the nodes in the Hadoop cluster to manipulate the data in a distributed way Pig exposes a command line shell called Grunt to execute Pig statements To launch the Grunt shell navigate to c apps dist pig 0 11 0 1 3 1 0 06 bin directory from the Hadoop Command Line Then execute the Pig command That should launch the Grunt shell as shown in Listing 6 12 Listing 6 12 Launching the Pig Grunt shell c apps dist pig 0 11 0 1 3 1 0 06 bin gt pig 2013 12 10 01 48 10 150 main INFO org apache pig Main Apache Pig version 0 11 0 1 3 1 0 06 x unknown compiled Oct 02 2013 21 58 30 2013 12 10 01 48 10 151 main INFO org apache pig Main Logging error messages to C apps dist hadoop 1 2 0 1 3 1 0 06 logs pig 1386640090147 log 2013 12 10 01 48 10 194 main INFO org apache pig impl util Utils Default bootup file D Users hadoopuser pigbootup not found 2013 12 10 01 48 10 513 main INFO org apache pig backend hadoop executionengine HExecutionEngine Connecting to hadoop file system at wasb democlustercontainer democluster blob core windows net 101 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE 2013 12 10 01 48 11 279 main INFO org apache pig backend hadoop executionengine HExecutionEngine Connecting to map reduce job tracker at jobtrackerh
334. to choosing WASB over HDFS e WASB storage incorporates all the HDFS features like fault tolerance geo replication and partitioning e Ifyou use WASB you disconnect the data and compute nodes That is not possible with Hadoop and HDFS where each node is both a data node and a compute node This means that if you are not running large jobs you can reduce the cluster s size and just keep the storage and probably at a reduced cost e You can spin up your Hadoop cluster only when needed and you can use it as a transient compute cluster instead of as permanent storage It is not always the case that you want to run idle compute clusters to store data In most cases it is more advantageous to create the compute resources on demand process data and then de allocate them without losing your data You cannot do that in HDFS but it is already done for you if you use WASB e You can spin up multiple Hadoop clusters that crunch the same set of data stored ina common blob location In doing so you essentially leverage Azure blob storage as a shared data store e Storage costs have been benchmarked to approximately five times lower for WASB than for HDFS e HDtInsight has added significant enhancements to improve read write performance when running Map Reduce jobs on the data from the Azure blob store e You can process data directly without importing to HDFS first Many people already on a cloud infrastructure have existing pipelin
335. to the file C windows system32 odbctrac d1l Once tracing is started all subsequent ODBC function calls will be recorded in the log file in your local machine A sample ODBC log file entries look similar to the following snippet test 1 c4 186c ENTER SQLAllocEnv HENV 0x500BC504 test 1 4 186c EXIT SQLAllocEnv with return code 0 SQL SUCCESS HENV Ox500BC504 0x008B9788 test 1 c4 186c ENTER SQLAllocEnv HENV 0x500BC508 test 1c4 186c EXIT SQLAllocEnv with return code 0 SQL SUCCESS HENV Ox500BC508 0x008B9808 test 1c4 186c ENTER SQLSetEnvAttr SQLHENV 0x008B9808 SQLINTEGER 201 lt SQL_ATTR_CONNECTION POOLING gt SOLPOINTER o lt SOL_CP_OFF gt SQLINTEGER 6 199 CHAPTER 11 test test test test test test 200 1c4 186c SOLHENV SQLINTEGER SOLPOINTER SQLINTEGER 1c4 186c HENV HDBC 1c4 186c HENV HDBC 1c4 186c HDBC UWORD PTR SWORD SWORD 1c4 186c HDBC UWORD PTR SWORD SWORD LOGGING IN HDINSIGHT EXIT SQLSetEnvAttr with return code 0 SQL_SUCCESS 0x008B9808 201 lt SQL_ATTR_CONNECTION POOLING gt o lt SQL_CP_OFF gt 6 ENTER SQLAllocConnect 0x008B9808 Ox004CAAB8 EXIT SQLAllocConnect with return code 0 SQL_SUCCESS 0x008B9808 OxOO4CAAB8 0x008BA108 ENTER SOLGetInfoW 0x008BA108 10 lt SQL_ODBC_VER gt 0x004CAA84 22 0x00000000 EXIT SQLGetInfoW with return code 0 SQL SUCCESS 0x008BA108 10 lt SQL_ODBC_VER gt 0x004CAA84 3 03 80 0000 0 22 0x000000
336. to track storage requests aspx Logging in Windows Azure HDinsight Emulator Windows Azure HDInsight Emulator is a single node distribution of HDInsight available on Windows Server platforms The logging mechanism on the emulator is almost exactly the same as in the Azure service There are only some minor changes to the log file paths to worry about Basically everything remains the same The only real change is that the base directory changes to C Hadoop as opposed to the C apps dist used in Azure Also since the emulator deploys HDInsight cluster version 1 6 as of this writing the directory names of each of the projects also change Figure 11 9 shows the directory structure of the emulator installation as of the writing of this book There is every possibility that the emulator will match the Azure HDInsight cluster versions in the near future and that everything will eventually be in sync 203 CHAPTER 11 LOGGING IN HDINSIGHT bu e JQ A gt Computer Local Disk C Hadoop Organize v Include in library v Share with v Burn New folder Se Favotes Name 8 Date modified Type L Bing Search J GettingStarted 11 13 2013 11 05 File folder BD Desktop Ji hadoop 1 1 0 SNAPSHOT 11 13 2013 11 03 File folder B Downloads J heatalog 0 4 1 10 22 2013 2 59PM File folder L Web Ji HDFS 11 13 2013 11 05 File folder Ai MSW Intranet J hive 0 9 0 11 13 2013 11 03 File folder E Recent Places J java 4 13 201211 15 AM Pie fol
337. tore new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new JobSubmissionCertificateCredential Constants subscriptionId cert Constants clusterName var jobClient JobSubmissionClientFactory Connect creds JobCreationResults jobResults jobClient CreateHiveJob hiveJobDefinition Console Write Executing Hive Job Wait for the job to complete WaitForJobCompletion jobResults jobClient Print the Hive job output System 10 Stream stream jobClient GetJobOutput jobResults JobId System 10 StreamReader reader new System 10 StreamReader stream Console Write Done List of Tables are n Console WriteLine reader ReadToEnd Once this is done you are ready to submit the Hive job to your cluster Running the Hive Job The final step is to add a call to the DoHiveOperations method in the Main function The Main method should now look similar to the following static void Main string args ListClusters CreateCluster DeleteCluster DoCustomMapReduce DoMapReduce DoHiveOperations Console Write Press any key to exit Console ReadKey 73 CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER Note You may need to comment out a few of the other function calls to avoid repetitive operations Execute the code and y
338. traditional Hadoop logging mechanism Hadoop uses the Apache Log4j framework for logging which is basically a logging package for Java This logging framework not only logs operational information it also gives you the control to tune different levels of logging as required for example errors or warnings and several instrumentation options like log recycling maintaining log history and so on This chapter will talk about a few key Log4j properties but for a detailed understanding on the Log4j framework you can visit the Apache site http logging apache org log4j 2 x manual index html Service Logs Hadoop daemons are replaced by Windows Services in the HDInsight distribution Different services run on different nodes of the cluster based on the role they play You need to make a remote desktop connection to the nodes to access their respective log files Service Trace Logs The service startup logs are located in the C apps dist hadoop 1 2 0 1 3 1 0 06 bin directory for the Hadoop services Similarly other service based projects in the ecosystem like Hive Oozie and so on log their service startup operations in their respective bin folders These files are marked with trace log extensions and they are created and written to during the startup of the services Table 11 1 summarizes the different types of trace log files available for the projects shipped in the current distribution of HDInsight on Azure 187 CHAPTER 11 LOG
339. tsesDasebsustevetaust veters 137 CONTENTS Testing th TE 138 Connecting to the HDInsight EMUATOL sisinio 143 Configuring a DN Jess COMMECTION cceseceesessesesesesseeseeseesseseeeeseaeseseeeeasseseeaeasseseceeasasseneeesaeseseneeasaeseeeeatas 144 SUE 145 Chapter 9 Consuming HDinsight from Self Service BI Tools 0ssssssccssssseesssseees 147 POWOEPIVOE Me aisis 147 Creating a Stock Report EEN 156 Power View for ExCel cs scsscsscessessscsssesssssrsecassersensesssarsersscnssessennesocarsersseneascasserssenssessesneassarss 161 Power Bi IT 163 lA TETEA AETA AAAA TTT 166 Chapter 10 Integrating HDinsight with SQL Server Integration Services 167 pape sigs car ENT ees ete oe note A A reared tree eae etree ant e tre 167 Creating the re ern ere e rere Tenner nen nner rene een errr tener e eer 168 Creating the Data e UN 170 Creating the Source Hive CONMMECTION s csesessesssssesssessesesseesessassassassaesassassaseansassaneaneaees 171 Creating the Destination SQL Connection cc ceeseseesessecseeseesessesessesseesesaesaesaesassaeeaseaseaees 173 Creating the Hive Source Component ccceseseesecseeseesecesesaeseeseeaesaeeaesaeeassaeeaesaeeaeeaeeaees 175 Creating the SQL Destination Component cscccssssssessessessessesseseesesseseeseeseessenseneess 179 Map Ping TG TT 181 Running the Package ccsccncscccecsctcccsctctcactatescectcactaccsctcteactatescncccnctaccnencteactatesencccnstacc
340. u see might change as new versions of the SDK are released You will find that the references to the respective d files have been added to your solution as shown in Figure 4 4 D Solution HadoopClient 1 project 4 HadoopClient b amp Properties 4 tel References SR Microsoft CSharp sp Microsoft WindowsAzure Management Framework 8 Microsoft WindowsAzure Management HDinsight sm System SR System Core Figure 4 4 The HadoopClient solution Connecting to Your Subscription The first step towards consuming your Azure services from any client application is to upload a management certificate to Azure This certificate will be subsequently used by the client applications to validate themselves while connecting to and using the Azure services For more information about how to create and upload a management certificate see the Create a Certificate section at the following link http msdn microsoft com en us library windowsazure gg981929 aspx The HDInsight management package Microsoft WindowsAzure Management HDInsight provides you with the NET APIs to automate operations such as creating a cluster creating a list and dropping existing clusters The first thing that needs to be done however is providing the client applications with your Azure subscription certificate and its thumbprint The standard NET X509 set of classes can be used to query the Azure certificate store But before that you will need to generate a unique thum
341. uces waiting after reserving slots ms 0 0 0 Job Counters Total time spent by all maps waiting after reserving slots ms 0 0 0 Launched map tasks 0 0 1 SLOTS_MILLIS_REDUCES 0 0 0 File Output Format Counters Bytes Written 0 0 0 File Input Format Counters Bytes Read 0 0 0 WASB_BYTES_READ 164 0 164 FILE_BYTES_READ 462 ol 462 FileSystemCounters HDFS_BYTES_READ 45 o 45 FILE_BYTES_WRITTEN 63 851 ol 63 851 WASB_BYTES_WRITTEN 238 0 238 Figure 6 12 MapReduce job statistics 105 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE The Hadoop MapReduce portal gives you a comprehensive summary on each of the submitted jobs You can drill down into the stdout and stderr output of the jobs so it is obvious that the portal is a great place to start troubleshooting a MapReduce job problem The Name Node Status Portal The Hadoop Name Node Status web interface shows a cluster summary including information about total and remaining capacity the file system and cluster health The interface also gives the number of live dead and decommissioning nodes The Name Node Status Portal is a Java web application that listens on port 50070 It can be launched from the URL http lt NameNode_IP_Address gt 50070 dfshealth jsp Additionally the Name Node Status Portal allows you to browse the HDFS actually WASB namespace and view the contents of its files in the web browser It also gives access to the name node s log file
342. uch like all the other types of jobs Pig jobs can also be submitted using a PowerShell script Listing 6 14 shows the PowerShell script to execute the same Pig job Listing 6 14 The PowerShell Pig job subid Your Subscription Id subName your Subscription name clusterName democluster 0 0 QueryString LOGS LOAD wasb example data sample log LEVELS foreach LOGS generate REGEX EXTRACT 0 TRACE DEBUG INFO WARN ERROR FATAL 1 as LOGLEVEL FILTEREDLEVELS FILTER LEVELS by LOGLEVEL is not null GROUPEDLEVELS GROUP FILTEREDLEVELS by LOGLEVEL FREQUENCIES foreach GROUPEDLEVELS generate group as LOGLEVEL COUNT FILTEREDLEVELS LOGLEVEL as COUNT RESULT order FREQUENCIES by COUNT desc DUMP RESULT pigJobDefinition New AzureHDInsightPigJobDefinition Query QueryString StatusFolder PigJobs PigJobStatus Submit the Pig Job to the cluster pigJob Start AzureHDInsightJob Subscription subid Cluster clusterName JobDefinition pigJobDefinition 103 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE Wait for the job to complete pigJob Wait AzureHDInsightJob Subscription subid WaitTimeoutInSeconds 3600 Using the Grunt shell in Pig is another way to bypass coding MapReduce jobs which can be tedious and time consuming The HDInsight name node gives you the option to interactively run Pig commands from their respective command shells Doing so is often
343. uery the data by executing HiveQL SELECT statements against the tables As with all data processing on HDInsight HiveQL queries are implicitly executed as MapReduce jobs to generate the required results HiveQL SELECT statements are similar to SQL queries and they support common operations such as JOIN UNION and GROUP BY For example you can use the code in Listing 8 8 to filter by stock_symbol column and also to return 10 rows for sampling because you don t know how many rows you may have 135 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Listing 8 8 Querying data from a Hive table subscriptionName YourSubscriptionName clustername democluster Select AzureSubscription SubscriptionName subscriptionName Use AzureHDInsightCluster clusterName Subscription Get AzureSubscription Current SubscriptionId querystring select from stock_analysis where stock_symbol LIKE MSFT LIMIT 10 Invoke Hive Query querystring You should see output similar to the following once the job execution completes Successfully connected to cluster democluster Submitting Hive query Started Hive query with jobDetails Id job 201311240635 0014 Hive query completed Successfully MSFT 2 8 2013 31 69 31 9 31 57 31 89 29121500 31 89 NASDAQ MSFT 1 8 2013 32 06 32 09 31 6 31 67 42328400 31 67 NASDAQ MSFT 31 07 2013 31 97 32 05 31 71 31 84 43898400 31 84 NASDAQ MSFT 30 07 2013 31 78 32 12 31 55 31 85 45799500 31 85 NASDAQ MSFT 29
344. uired packages as shown in Figure 4 2 40 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING SQL TOOLS TEST ARCHITECTURE ANALYZE WINDOW HELP bug si Attach to Process Ctri Alt P w Connect to Database v Solution Explorer Ae Connect to Server D 5 o a ep di Code Snippets Manager Ctri K Ctrl B a a earch Solution Explorer Ctrl4 Choose Toolbox Items EI Solution HadoopClient Add in Manager 4 HadoopClient Library Package Manager Package Manager Console pperties ferences DR Extensions and Updates fA Manage NuGet Packages for Solution Re Error Lookup sh Package Visualizer ogram cs PreEmptive Dotfuscator and Analytics t Package Manager Settings WCF Service Configuration Editor External Tools Import and Export Settings Customize Z Options Figure 4 2 NuGet Package Manager Console Table 4 1 summarizes the NuGet packages available currently for HDInsight with a brief description of each Table 4 1 HDInsight NuGet packages Package Name Microsoft WindowsAzure Management HDInsight Microsoft Hadoop WebClient Microsoft Hadoop Hive Microsoft Hadoop MapReduce Microsoft Hadoop Avro Function Set of APIs for HDInsight cluster management operations Set of APIs to work with the Hadoop file system Set of APIs for Hive operations Set of APIs for MapReduce job submission and execution Set of APIs for data serialization based on the Apache open source proj
345. ur Program cs file with code as shown in Listing 5 8 Listing 5 8 The WaitForJobCompletion method private static void WaitForJobCompletion JobCreationResults jobResults IJobSubmissionClient client JobDetails jobInProgress client GetJob jobResults JobId while jobInProgress StatusCode JobStatusCode Completed amp amp jobInProgress StatusCode JobStatusCode Failed jobInProgress client GetJob jobInProgress JobId Thread Sleep TimeSpan FromSeconds 1 Console Write Then add the DoMapReduce function in your Program cs file This function will have the actual code to submit the wordcount job The first step is to create the job definition and configure the input and output parameters for the job This is done using the MapReduceJobCreateParameters class Define the MapReduce job MapReduceJobCreateParameters mrJobDefinition new MapReduceJobCreateParameters JarFile wasb example jars hadoop examples jar ClassName wordcount I mrJobDefinition Arguments Add wasb example data gutenberg davinci txt mrJobDefinition Arguments Add wasb example data WordCountOutput The next step as usual is to grab the correct certificate credentials based on the thumbprint var store new X509Store store Open OpenFlags ReadOnly var cert store Certificates Cast lt x509Certificate2 gt First item gt item Thumbprint Constants thumbprint var creds new JobSubmission
346. ur cloud services in one place CL Windows Azure Vv Subscriptions kd SE e eege all items NAME TYPE STATUS SUBSCRIPTION LOCATION D gt Mobile Service A Ready East Asia CRM SQL Database VW Online Southeast Asia SyncHub SQL Database VW Online emm Southeast Asia HDiInsight SQL Database VW Online EE Southeast Asia debarchans Cloud service VW Created EZ Southeast Asia debarchansBYOS Cloud service VW Running Southeast Asia MovieReviews Cloud service V Created Southeast Asia portalvhdsgl8czSbdj6b9k Storage Account VW Online es Southeast Asis debarchans Storage Account VW Online EE Southeast Asia portalvhdsvifr7jgdSvdg1 Storage Account VY Online ege Southeast Asia debarchansmedia Media Service V Active EH Southeast Asia Figure 2 1 The Windows Azure Management Portal The account and the subscription can be managed by the same individual or by different individuals or groups In a corporate enrollment an account owner might create multiple subscriptions to give members of the technical staff access to services Because resource usage within an account billing is reported for each subscription an organization can use subscriptions to track expenses for projects departments regional offices and so forth A detailed discussion of Windows Azure is outside the scope of this book If you are interested you should visit the Microsoft official site for Windows Azure http www windowsazure com en us Windows Azure HDinsight Service
347. use Windows Azure Storage Emulator to emulate the Windows Azure Storage blob WASB table and queue cloud services on your local machine Doing so helps you in getting started with basic testing and evaluation locally without incurring the cost associated with cloud service The Windows Azure Storage emulator comes as a part of Windows Azure SDK for NET This book however does not use the storage emulator rather it uses actual WASB as HDiInsight storage Detailed instructions on configuring the storage emulator to be used from the HDInsight emulator can be found at http www windowsazure com en us documentation articles hdinsight get started emulator blobstorage The emulator also deploys the same set of Windows Services as the Azure service You can open up the Windows Services console from Start gt Run gt Services msc to start stop and set the startup type of the Apache Hadoop services as shown in Figure 7 8 p A iption Jame Description Status Startup Ty 4 Apache Hadoop derbyserver Started Automati Apache Hadoop historyserver Started Automati Apache Hadoop hiveserver Started Automati Apache Hadoop hiveserver2 Started Automati Apache Hadoop hwi Started Automati Apache Hadoop jobtracker Started Automati 4 Apache Hadoop metastore Started Automati Apache Hadoop namenode Started Automati Apache Hadoop oozieservice Started Automati Apache Hadoop secondarynamenode Started Automati Apache Ha
348. ust container SampleTableQueryFolder exit SampieTebdieQueryFoider logs http democuster biob core windows net Gemociustercontainet SampieTableQueryFoider logs Figure 5 3 Numbers txt uploaded in blob 65 www allitebooks com CHAPTER 5 SUBMITTING JOBS TO YOUR HDINSIGHT CLUSTER You are now ready to invoke the job executor from your Main method using the ExecuteJob method In your Program cs file add a function DoCustomMapReduce with code like that in Listing 5 6 Note that this chapter will be using several inbuilt NET Framework classes for IO threading and so on Make sure you have the following set of using statements in your Program cs file using System using System Collections Generic using System Ling using System Text using System Security Cryptography X509Certificates using Microsoft WindowsAzure Management HDInsight using Microsoft Hadoop MapReduce using Microsoft Hadoop Client For Stream IO using System I0 For Ambari Monitoring Client using Microsoft Hadoop WebClient AmbariClient using Microsoft Hadoop WebClient AmbariClient Contracts For Regex using System Text RegularExpressions For thread using System Threading For Blob Storage using Microsoft WindowsAzure Storage using Microsoft WindowsAzure Storage Blob Listing 5 6 DoCustomMapReduce method public static void DoCustomMapReduce Console WriteLine Starting MapReduce job Log in remotely to your
349. uster ER core windows necisemaciustercentainer B demociuster ln DE ne Sun 24 Nov 2013 06 43 09 GMT nips J Semocuster bic core windows netisemaciustercentainer fl democlustercontainer 3 tn ne Sun 24 Now 2013 06 42 28 GMT ps J Semocuster bice core windows necisemociustercentainer 4 ne Sun 24 Nov 2013 06 42 43 GMT nttps Semecuster bica core windows netisemaciustercontainer DR ze Sun 28 Now 2083 O645 23 GMT nips oemocuster EE core winsows netjoemooustercontainer 6 rat Ten Fiters zeg Sun 24 Now 2033 06 82 57 GMT nieps semocuster Biot core windows nasisemaciustercentainer 7 tind not Sun 24 Nov 2013 06 41 25 GMT ito J democuster bicb core window netidemaciustercentainer Ce L nal Sun 24 Nov 2013 06 55 17 GMT nitps J democuster blob core windows netisemaciustercentainer 9 ting Select All Search Resuks eut Sun 24 Now 2013 06 55 17 GMT Ip J Semacuster ER core windows netisemactustercontainer c 10 fm a cw ma Sun 24 Now 2013 07 00 20GMT nttps Semocuster Bob core windows net Gemochustercontainer c 11 fin nal Sun 24 Nov 2023 07 00 42 GMT ips J democuster bic core windows netisemaciustercentainer c 2 tn fol Sun 24 Now 2035 07 56 13 GMT Nps Semoouster DAO Core windows mey Germociustercontainer d 13 tin nol Sun 24 Now 2033 07 46 27 GMT nttps Semocuster Bic core window met Semmaciustercontainer c 4 Bin nat Sun 24 Now 2013 07 45 22 GMT nttos J oemecuster Ee core windows netisemaciustercentainer c 15 ni 24 Now 2015 07 47 17 GMT nttps
350. ut the log files for the different services and where to look if something goes wrong Chapter 12 Troubleshooting Cluster Deployments is about troubleshooting scenarios you might encounter during your cluster creation process This chapter explains the different stages of a cluster deployment and the deployment logs on the Name Node as well as offering some tips on troubleshooting C and PowerShell based deployment scripts Chapter 13 Troubleshooting Job Failures explains the different ways of troubleshooting a MapReduce job execution failure This chapter also speaks about troubleshooting performance issues you might encounter such as when jobs are timing out running out of memory or running for too long It also covers some best practice scenarios Downloading the Code The author provides code to go along with the examples in this book You can download that example code from the book s catalog page on the Apress com website The URL to visit is http www apress com 9781430260554 Scroll about halfway down the page Then find and click the tab labeled Source Code Downloads Contacting the Author You can contact the author Debarchan Sarkar through his twitter handle debarchans You can also follow his Facebook group at https www facebook com groups bigdatalearnings and his Facebook page on HDInsight at https www facebook com MicrosoftBigData xxi www allitebooks com CHAPTER 1 Introducing HDinsig
351. ver table updating existing rows based on an update key column as well as invoking a stored procedure execution sqoop job The job command enables you to save your import export commands as a job for future re use The saved jobs remember the parameters that are specified during execution and they are particularly useful when there is a need to run an import or export command repeatedly on a periodic basis sqoop version To quickly check the version of sqoop you are on you can run the sqoop version command to print the installed version details on the console 97 CHAPTER 6 EXPLORING THE HDINSIGHT NAME NODE For example assuming that you have a database called sqoopdemo deployed in SQL Azure that has a table called stock_analysis you can execute the import command in Listing 6 8 to import that table s data into blob storage Listing 6 8 The Sqoop import command sqoop import connect jdbc sqlserver lt Server gt database windows net username debarchans lt Server gt password lt Password gt database sqoopdemo table stock_analysis target dir example data StockAnalysis as textfile m 1 On successful execution of the import job you will see output on the Sqoop console similar to Listing 6 9 Listing 6 9 The Sqoop import output Warning HBASE_HOME and HBASE_ VERSION not set Warning HBASE HOME does not exist HBase imports will fail Please set HBASE HOME to the root of your HBase installation 13 12 10 01
352. wing basic components e Name Node This is also called the Head Node of the cluster Primarily it holds the metadata for HDFS That is during processing of data which is distributed across the nodes the Name Node keeps track of each HDFS data block in the nodes The Name Node is also responsible for maintaining heartbeat co ordination with the data nodes to identify dead nodes decommissioning nodes and nodes in safe mode The Name Node is the single point of failure in a Hadoop cluster e Data Node Stores actual HDFS data blocks The data blocks are replicated on multiple nodes to provide fault tolerant and high availability solutions e Job Tracker Manages MapReduce jobs and distributes individual tasks e Task Tracker Instantiates and monitors individual Map and Reduce tasks Additionally there are a number of supporting projects for Hadoop each having its unique purpose for example to feed input data to the Hadoop system to be a data warehousing system for ad hoc queries on top of Hadoop and many more Here are a few specific examples worth mentioning e Hive A supporting project for the main Apache Hadoop project It is an abstraction on top of MapReduce that allows users to query the data without developing MapReduce applications It provides the user with a SQL like query language called Hive Query Language HQL to fetch data from the Hive store e PIG An alternative abstraction of MapReduce that uses a data flow scriptin
353. wordCountJob Start AzureHDInsightJob Cluster cluster JobDefinition hdinsightJob Credential creds Wait for the job to complete Wait AzureHDInsightJob Job wordCountJob WaitTimeoutInSeconds 3600 Credential creds 124 CHAPTER 7 USING WINDOWS AZURE HDINSIGHT EMULATOR Note When prompted for credentials provide hadoop as the user name and type in any text as the password This is essentially a dummy credential prompt which is needed to maintain compatibility with the Azure service from PowerShell scripts Future Directions With hardware cost decreasing considerably over the years organizations are leaning toward appliance based data processing engines An appliance is a combination of hardware units and built in software programs suitable for a specific kind of workload Though Microsoft has no plans to offer a multinode HDInsight solution for use on premises it does offer an appliance based multiunit and massively parallel processing MPP device called the Parallel Data Warehouse PDW Microsoft PDW gives you performance and scalability for data warehousing with the plug and play simplicity of an appliance Some nodes in the appliance can run SQL PDW and some nodes can run Hadoop called a Hadoop Region A new data processing technology called Polybase has been introduced which is designed to be the simplest way to combine nonrelational data and traditional relational data for your analysis It acts as a bridge to a
354. writing it deploys HDInsight components version 1 6 In all probability the HDInsight emulator will be upgraded soon to match the version of the Azure service and both will have same set of log files Service Wrapper Files Apart from the startup logs there are something called wrapper logs available for the HDInsight services These files contain the startup command string to start the service It also provides the output of the process id when the service starts successfully They are of wrapper log extension and are available in the same directory where the out log files reside For example if you open hiveserver wrapper log you should see commands similar to the snippet below org apache hadoop hive service HiveServer hiveconf hive hadoop classpath c apps dist hive 0 9 0 lib hiveconf hive metastore local true hiveconf hive server servermode http p 10000 hiveconf hive querylog location c apps dist hive 0 9 0 logs history hiveconf hive log dir c apps dist hive 0 9 0 logs 2013 08 11 16 40 45 Started 4264 Note that the process id of the service is recorded at the end of the wrapper log This is very helpful in troubleshooting scenarios where you may want to trace on a specific process which has already started for example determining the heap memory usage of the name node process when it is running while troubleshooting an out of memory problem Service Error Files The HDInsight version 1 6 services also generate an
355. y created called Sample Microsoft Hive DSN Connecting to the HDInsight Emulator There are a few differences between connecting to Windows Azure HDInsight service and the single node HDInsight Emulator on premises using the ODBC driver You are not required to provide any user name or password to connect to the emulator The other two key differences between connecting to the emulator and connecting to the Azure service are e PortNumber The ODBC driver connects to the emulator using port 10001 e Authentication Mechanism The mechanism used is Windows Azure HDInsight Emulator Figure 8 11 shows the configuration screen of the ODBC DSN when connecting to the HDInsight Emulator 143 CHAPTER 8 ACCESSING HDINSIGHT OVER HIVE AND ODBC Microsoft Hive ODBC Driver DSN Setup Data Source Name HadoopOnEmulator Description DSN to connect to Hive on HDInsight Emulator Host localhost Port 10001 lt Database default LU Hive Server Type ve Server 2 7 Authentication Mechanism Windows Azure HDInsight Emulator xj Realm Host FQDN Service Name HTTP Path User Name Password Advanced Options I v1 0 0 0 64bit Le Les Figure 8 11 Connecting to Windows Azure HDInsight Emulator Configuring a DSN less Connection Using a DSN requires you to preregister the data source using the Windows ODBC Data Source Administrator You ll then be able to reference this DSN entry by
356. y failure that happens during a Pig job execution A sample excerpt of such a trace is shown in Listing 13 16 Listing 13 16 Pig Stack Trace ERROR 1000 Error during parsing Encountered lt IDENTIFIER gt exit at line 4 column 1 Was expecting one of lt EOF gt lt EQL gt org apache pig tools pigscript parser ParseException Encountered lt IDENTIFIER gt exit at line 4 column 1 Was expecting one of lt EOF gt It is important to understand that for each of these supporting projects the underlying execution framework is still MapReduce Thus if a job failure occurs at the MapReduce phase the JobTracker logs are the place to investigate Explain Command The EXPLAIN command in Pig shows the logical and physical plans of the MapReduce jobs triggered by your Pig Latin statements Following is the Pig statement we executed in Chapter 6 to aggregate and sort the output messages from the Sample log file We ll use it as the basis for an example Launch the Pig command shell from the c apps dist pig 0 11 0 1 3 1 0 06 bin folder and type in the lines of script one after another LOGS LOAD wasb example data sample log LEVELS foreach LOGS generate REGEX _EXTRACT 0 TRACE DEBUG INFO WARN ERROR FATAL 1 as LOGLEVEL FILTEREDLEVELS FILTER LEVELS by LOGLEVEL is not null 235 CHAPTER 13 TROUBLESHOOTING JOB FAILURES If you wish to display the Logical Physical and MapReduce execution plans fo
357. y hdinsightstorage Primary VERBOSE 8 50 29 AM Begin Operation Get AzureStorageKey VERBOSE 8 50 34 AM Completed Operation Get AzureStorageKey If you provide the wrong storage account name or one that belongs to a different subscription you might get error messages like the following ones while trying to acquire the storage account key PS C gt key1 Get AzureStorageKey hdinsightstorage Primary VERBOSE 1 30 18 PM Begin Operation Get AzureStorageKey Get AzureStorageKey An exception occurred when calling the ServiceManagement API HTTP Status Code 404 ServiceManagement Error Code ResourceNotFound Message The storage account hdinsightstorage was not found OperationTracking ID 72cOc6bb12b94f849aa8884154655089 Note If you have multiple subscriptions you can use Set AzureSubscription DefaultSubscription lt Your_ Subscription Name gt to set to default subscription in PowerShell where your cluster storage accounts reside 53 CHAPTER A AUTOMATING HDINSIGHT CLUSTER PROVISIONING Now you have all the necessary information to spin up a new cluster using the cmdlet The following snippet shows you the command with all the required parameters to provision a new HDInsight cluster New AzureHDInsightCluster SubscriptionId subid Certificate cert Name AutomatedHDI Location East US DefaultStorageAccountName hdinsightstorage blob core windows net DefaultStorageAccountKey key1 Defaul
358. y incorporate components for certain functions NuGet is a Visual Studio extension that makes it easy to add remove and update libraries and tools in Visual Studio projects that use the NET Framework When you add a library NuGet copies files to your solution and automatically adds and updates the required references in your app config or web config file NuGet also makes sure that it reverts those changes when the library is dereferenced from your project so that nothing is left behind For more detailed information visit the NuGet documentation site http nuget codeplex com documentation 39 CHAPTER 4 AUTOMATING HDINSIGHT CLUSTER PROVISIONING There are NuGet packages for HDInsight that need to be added to your solution Starting with Visual Studio 2013 the version that I am using to build the samples for this book NuGet is included in every edition except Team Foundation Server by default If you are developing on a Visual Studio 2010 platform or for some reason you cannot find it in Visual Studio 2013 you can download the extension from the following link http docs nuget org docs start here installing nuget Once you download the extension you will have a NuGet Tools vsix file which is a Visual Studio Extension Execute the file and the VSIX installer will install the Visual Studio add in Note that you will need to restart Visual Studio if it is already running after the add in installation This add in will enable you to im
359. ze bo GettingStarted 10 29 2013 7 57PM File folder J hadoop 1 1 0 SNAPSHOT 10 29 2013 7 55PM File folder Ji heatalog 0 4 1 10 23 2013 2 29AM File folder i dJi HDFS 10 29 2013 7 57PM File folder Ji hive 0 9 0 10 29 2013 7 55PM File folder i Ji java 4 13 2012 1045 PM File folder eum J cozie 3 2 0 incubating 10 29 2013 196 PM File folder Ji pig 0 9 3 SNAPSHOT 10 23 2013 2 24AM File folder Ji sqoop 1 4 2 10 29 2013 7 55PM File folder J templeton 0 14 10 29 2013 7 56PM File folder license rtf 10 25 2013 12 54 Rich Text Format 207 KB Rhino attributions bt 5 21 2013 8 46 AM Text Document 28 KB set onebox autostart cmd 10 25 2013 12 54 Windows Comma 2 KB E set onebox autostart ps1 10 25 2013 12 54 Windows Powers 1KB set onebox manualstart cmd 10 25 2013 12 54 Windows Comma 2 KB set onebox manualstart ps1 10 25 2013 12 54 Windows Powers 1 KB singlenodecreds xml 10 29 2013 7 54 PM XML Document 2KB Sqoop attributions t 5 21 20138 46 AM Text Document 21 KB E start onebox cmd 10 23 2013 2 46 AM Windows Comma 1KB 8 start onebox psl 10 23 2013 2 46 AM Windows Powers 2KB stop onebox cmd 10 23 2013 2 46 AM Windows Comma 1KB vr Figure 7 6 HDInsight Emulator installation directory Also the logging infrastructure along with the log files and paths is exactly identical to what you see in the actual Azure service Name Node Each of the project folders has its respective log directories t
360. zure storage management Although the release versions of this tool need to be purchased there is still an older version available as freeware That older version can be downloaded from the following URL http clumsyleaf com products cloudxplorer Windows Azure Explorer This is another Azure storage management utility which offers both a freeware and a paid version A 30 day trial of the paid version is available It is a good idea to evaluate either the freeware version or the 30 day trial before making a purchase decision You can grab this tool from the following page http www cloudberrylab com free microsoft azure explorer aspx Apart from these utilities there are a few programmatic interfaces that enable you to develop your own application to manage your storage blobs Those utilites are e AzCopy e Windows Azure PowerShell e Windows Azure Storage Client Library for NET e Hadoop command line To get a complete understanding on how you can implement these programmatic interfaces and build your own data upload solution check the link below http www windowsazure com en us manage services hdinsight howto upload data to hdinsight Windows Azure Flat Network Storage Traditional Hadoop leverages the locality of data per node through HDFS to reduce data traffic and network bandwidth On the other hand HDInsight promotes the use of WASB as the source of data thus providing a unified and more manageable platform for both stor
Download Pdf Manuals
Related Search
Related Contents
SERVICE MANUAL 随意契約に係る情報の公表(平成22年度7月契約分) 物品役務等の名称 BRUKSANVISNING · OPERATING INSTRUCTIONS Mode d`emploi Manual QA, QAe German, English, French, Dutch Bedienungsanleitung_smart_mx_26 KEN - transportal.fi Weldex US-W205M User's Manual Sunbeam 7682 User's Manual PowerPoint プレゼンテーション Copyright © All rights reserved.
Failed to retrieve file