import React from "react";

function ConfiguringSpark () {
    return (
        <section>
            <div className="ug-content-wrapper" data-content="Configuring Spark">
                <div className="ug-content-main-header">Configuring Spark</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>Spark settings are only available for customers using the DCP On-Premesis edition.</p>
                    <p>DCP uses Apache Spark for data processing. For each version of DCP, a specific version of Apache Spark must be deployed on the DCP host.</p>
                    <p>The unifi-prereqs script, which is run during DCP installation, downloads the required version of Spark and installs it on the DCP host. The following additional configuration steps ensure that Spark works properly with DCP.</p>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Configuring System Data Source Access for Spark">
                <div className="ug-content-main-header">Configuring System Data Source Access for Spark</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>Spark must be configured with access to the DCP System HDFS and System Hive data sources directly. Spark accesses datasets located on System HDFS and System Hive. Therefore, the following steps are necessary:</p>
                    <ol>
                        <li>Set HADOOP_CONF_DIR to the directory that contains the System HDFS configuration files.</li>
                        <li>Copy hive-site.xml to the directory $SPARK_HOME/conf.</li>
                        <li>To make sure the Hive metastore JDBC jar file is available to Spark at runtime, copy the Hive metastore JDBC jar to $SPARK_HOME/jars.</li>
                        <li>Set the UNIFI_EXECUTOR_USER environment variable to the same username as the DCP System HDFS account. This is important so that when the DCP Spark application runs a transform job, it can generate output as the System HDFS user, ensuring a high enough permission level on the output directory to export job output files.</li>
                    </ol>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Choosing Local or YARN Execution">
                <div className="ug-content-main-header">Choosing Local or YARN Execution</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>DCP’s main software module, the executor service, runs as a Spark application. The executor service can run in local mode or YARN mode. In a production deployment, YARN mode is typically used.</p>
                    <p>To choose between local and YARN execution, use the  <span className={"font-weight-300"}>--executor-mode</span> parameter on the DCP startup script. For example:</p>
                    <p className={"font-weight-300"}>unifi_start --executor-mode yarn</p>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="YARN mode">
                <div className="ug-content-main-header">YARN mode</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>In YARN mode, the DCP executor service runs as an Apache Hadoop YARN application on the System HDFS YARN cluster.</p>
                    <p>The cluster’s resource manager shows the running executor service as a YARN application. For example, in a Cloudera cluster, the DCP executor service typically runs at http://&lt;namenodehost&gt;:8088/cluster.</p>
                    <p>In YARN mode, Spark executors run as separate processes on the data nodes of the Hadoop cluster. The resource allocation settings in executor.conf should be considered carefully to optimize the running of both DCP and the Hadoop cluster. The optimal settings depend on how big the cluster is and what it is being used for aside from DCP.  Resources allocated from the cluster are held throughout the lifetime of the DCP executor service.</p>
                    <p>The DCP uses the Hadoop cluster for:</p>
                    <ul>
                        <li>Sqoop jobs</li>
                        <li>Hive jobs</li>
                        <li>Spark jobs</li>
                        <li>Export jobs</li>
                    </ul>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Local mode">
                <div className="ug-content-main-header">Local mode</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>Not typically for production use.</p>
                    <p>In local mode, the DCP executor service runs on the DCP host and uses no cluster resources.</p>
                    <p>Use this mode with caution. Without the computing power provided by the System HDFS YARN cluster, performance can be affected.</p>
                    <p>In local mode, the resource allocation settings in executor.conf are less important. The DCP executor-service, Spark driver, and Spark executors run in the same Java virtual machine (JVM) on the DCP host. Increasing the number of executors spawns more threads within the JVM to provide parallelism.</p>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Allocating Resources for Spark">
                <div className="ug-content-main-header">Allocating Resources for Spark</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>Running the DCP application consumes a certain amount of computing and memory resources. By configuring this resource usage, you can optimize runtime performance and make sure that DCP and other applications share resources in a way that does not result in unbalanced competition.</p>
                    <p>The resource allocation settings are in the file:</p>
                    <p className={"font-weight-300"}>$UNIFI_HOME/services/executor/executor-service/executor.conf</p>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Configuring Spark Context">
                <div className="ug-content-main-header">Configuring Spark Context</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>The Spark context for the DCP executor service is configured in the spark section of the file executor.conf.</p>
                    <p>Be sure to read and follow the recommendations in <a href={"https://docs.google.com/document/d/1Nk8VvYIiaPvR9xiWPxOlMJCxFlAFQfwrBPwhbxBnipI/edit#heading=h.i17xr6"} className={"link"} target={"_blank"} rel={"noopener noreferrer"}>Allocating Cluster Resources for Spark</a>.</p>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Configuring Executor Service Jobs">
                <div className="ug-content-main-header">Configuring Executor Service Jobs</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>When a DCP user takes an action such as requesting full statistics, previewing data, or running a transform job, this starts an executor service job. The executor service job, in turn, triggers multiple Spark jobs to achieve the end result.</p>
                    <p>The number of Spark jobs triggered by each DCP executor service job depends to a great extent on internal DCP calculations. However, you can control some job concurrency parameters at the executor service level by using the jobs section in executor.conf. For example:</p>
                    <p className={"font-weight-300"}>
                        jobs &#123;<br/>
                            ○○maxViewsInCache = 300<br/>
                            ○○maxConcurrentSaveJobs = 3<br/>
                            ○○maxConcurrentViewJobs = 20<br/>
                        &#125;
                    </p>

                    <p><b>The settings in the jobs section are:</b></p>
                    <ul>
                        <li>maxViewsInCache – The DCP executor service caches the results of preview jobs, statistics jobs, and so on in its memory. This parameter controls how many results DCP can cache. Increase or decrease this value depending on how much memory is allocated to the executor service. Default: 300.</li>
                        <li>maxConcurrentSaveJobs – Number of full statistics or transform jobs that can run in parallel. Default: 3.</li>
                        <li>maxConcurrentViewJobs - Number of preview jobs that can run in parallel. Default: 20.</li>
                    </ul>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Scheduling Spark Jobs">
                <div className="ug-content-main-header">Scheduling Spark Jobs</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>Multiple Spark jobs can be simultaneously active while running DCP. Spark provides two options for managing parallel jobs; FIFO and fair.</p>
                    <p>When running DCP, multiple Spark jobs are active simultaneously. Spark provides two options for managing parallel jobs:</p>
                    <ul>
                        <li>FIFO - first in first out</li>
                        <li>Fair - all parallel jobs get a fair share of Spark resources on the cluster so long running jobs do not block quicker jobs.</li>
                    </ul>
                    <p>For more information about FIFO and fair scheduling, see <a href={"http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application"} className={"Link"} target={"_blank"} rel={"noopener noreferrer"}>Scheduling Within an Application</a> in the Apache Spark documentation.</p>
                </div>
            </div>

            <div className="ug-content-wrapper" data-content="Troubleshooting Spark Jobs">
                <div className="ug-content-main-header">Troubleshooting Spark Jobs</div>
                <div className="col-12 col-md-8 col-lg-9 ug-content">
                    <p>The DCP runs as multiple Spark jobs, typically on a Hadoop YARN cluster (this depends on DCP startup mode; see Choosing Local or YARN Execution for more information. When troubleshooting, you can check to make sure that the Spark application you triggered on YARN is running. Open the YARN resource manager console:</p>
                    <p>http://&lt;namenodehost&gt;:8088/cluster</p>
                    <p>You should see something like the following. The DCP Spark application is named unifi-jobs.</p>
                    <figure>
                        <img alt="Boomi Data Catalog & Prep" src={require('../assets/images/spark_1.png')} />
                    </figure>
                    <p>To see more details, click ApplicationMaster. This opens the Spark UI, which gives more information about the jobs that are running in the Spark context. The Spark UI shows whether DCP is waiting for a Spark job to complete. This information can help you troubleshoot DCP runtime delays.</p>
                    <figure>
                        <img alt="Boomi Data Catalog & Prep" src={require('../assets/images/spark_2.png')} />
                    </figure>
                </div>
            </div>
        </section>
    )
}

export default ConfiguringSpark;
