This section describes how settings related to running YARN application can be modified.
All applications whether those are stream apps or task apps can be
centrally configured with servers.yml as that file is passed to apps
using --spring.config.location='servers.yml'.
Stream and task processes for application master and containers can be further tuned by setting memory and cpu settings. Also java options allow to define actual jvm options.
spring:
cloud:
deployer:
yarn:
app:
streamappmaster:
memory: 512m
virtualCores: 1
javaOpts: "-Xms512m -Xmx512m"
streamcontainer:
priority: 5
memory: 256m
virtualCores: 1
javaOpts: "-Xms64m -Xmx256m"
taskappmaster:
memory: 512m
virtualCores: 1
javaOpts: "-Xms512m -Xmx512m"
taskcontainer:
priority: 10
memory: 256m
virtualCores: 1
javaOpts: "-Xms64m -Xmx256m"Base directory where all needed files are kept defaults to /dataflow
and can be changed using baseDir property.
spring:
cloud:
deployer:
yarn:
app:
baseDir: /dataflowSpring Cloud Data Flow app registration is based on URI’s with various
different endpoints. As mentioned in section Chapter 18, How YARN Deployment Works all
applications are first stored into hdfs before application container
is launched. Server can use http, file, http and maven based
uris as well direct hdfs uris.
It is possible to place these applications directly into HDFS and register application based on that URI.
Logging for all components is done centrally via servers.yml file
using normal Spring Boot properties.
logging:
level:
org.apache.hadoop: INFO
org.springframework.yarn: INFOYARN Nodemanager is continously tracking how much memory is used by individual YARN containers. If containers are using more memory than what the configuration allows, containers are simply killed by a Nodemanager. Application master controlling the app lifecycle is given a little more freedom meaning that Nodemanager is not that aggressive when making a desicion when a container should be killed.
![]() | Important |
|---|---|
These are global cluster settings and cannot be changed during an application deployment. |
Lets take a quick look of memory related settings in YARN cluster and in YARN applications. Below xml config is what a default vanilla Apache Hadoop uses for memory related settings. Other distributions may have different defaults.
2.1 or bugs in
a OS is causing wrong calculation of a used virtual memory.Defines a minimum allocated memory for container.
![]() | Note |
|---|---|
This setting also indirectly defines what is the actual physical
memory limit requested during a container allocation. Actual physical
memory limit is always going to be multiple of this setting rounded to
upper bound. For example if this setting is left to default |
Enabling kerberos is relatively easy when existing kerberized cluster exists. Just like with every other hadoop related service, use a specific user and a keytab.
spring:
hadoop:
security:
userPrincipal: scdf/_HOST@HORTONWORKS.COM
userKeytab: /etc/security/keytabs/scdf.service.keytab
authMethod: kerberos
namenodePrincipal: nn/_HOST@HORTONWORKS.COM
rmManagerPrincipal: rm/_HOST@HORTONWORKS.COM
jobHistoryPrincipal: jhs/_HOST@HORTONWORKS.COM![]() | Note |
|---|---|
When using ambari, configuration and keytab generation are fully automated. |
![]() | Important |
|---|---|
Currently released kafka based apps doesn’t work with cluster where zookeeper and kafka itself are configured to for kerberos authentication. Workaround is to use rabbit based apps or build stream apps based on new kafka binder having support for kerberized kafka. |
After a kafka based stream app has a kerberos support, some settings
in ambari’s kafka configuration needs to be changed. Effectively
listeners and security.inter.broker.protocol needs to use
SASL_PLAINTEXT. Also binder needs to be able to create topics, thus
scdf user needs to be added to a kafka’s super users.
listeners=SASL_PLAINTEXT://localhost:6667 security.inter.broker.protocol=SASL_PLAINTEXT super.users=user:kafka;user:scdf
Additional configs are needed for binder and sasl config.
spring:
cloud:
stream:
kafka:
binder:
configuration:
security:
protocol: SASL_PLAINTEXT
spring:
cloud:
deployer:
yarn:
app:
streamcontainer:
saslConfig: "-Djava.security.auth.login.config=/etc/scdf/conf/scdf_kafka_jaas.conf"Where scdf_kafka_jaas.conf looks something like shown below.
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/etc/security/keytabs/scdf.service.keytab"
storeKey=true
useTicketCache=false
serviceName="kafka"
principal="scdf/sandbox.hortonworks.com@HORTONWORKS.COM";
};![]() | Important |
|---|---|
When ambari is kerberized via its wizard, everything else is
automatically configured except kafka settings for a |
Generic settings for dataflow components to work with
HA setup can be seen below where id is set to mycluster.
spring:
hadoop:
fsUri: hdfs://mycluster:8020
config:
dfs.ha.automatic-failover.enabled=True
dfs.nameservices=mycluster
dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
dfs.ha.namenodes.mycluster=nn1,nn2
dfs.namenode.rpc-address.mycluster.nn2=ambari-3.localdomain:8020
dfs.namenode.rpc-address.mycluster.nn1=ambari-2.localdomain:8020![]() | Note |
|---|---|
When using ambari and Hdfs HA setup, configuration is fully automated. |
On default a dataflow server will start embedded H2 database using in-memory storage and effectively using configuration.
spring:
datasource:
url: jdbc:h2:tcp://localhost:19092/mem:dataflow
username: sa
password:
driverClassName: org.h2.DriverDistribution package contains a bundled self-contained H2 executable which can be used instead. This allows to persist data throughout server restarts and is not limited to single host.
./bin/dataflow-server-yarn-h2 --dataflow.database.h2.directory=/var/run/scdf/data
spring:
datasource:
url: jdbc:h2:tcp://neo:19092/dataflow
username: sa
password:
driverClassName: org.h2.Driver![]() | Important |
|---|---|
With external H2 instance you cannot use |
![]() | Note |
|---|---|
Port can be changed using property |
This bundled H2 database is also used in ambari to have a default
out of a box functionality. Any database supported by a dataflow
itself can be used by changing datasource settings.
YARN Deployer has to be able to talk with Application Master which then is responsible controlling containers running stream and task applications. The way this work is that Application Master tries to discover its own address which YARN Deployer is then able to use. If YARN cluster nodes have multiple NICs or for some other reason address is discovered wrongly, some settings can be changed to alter default discovery logic.
Below is a generic settings what can be changed.
spring
yarn:
hostdiscovery:
pointToPoint: false
loopback: false
preferInterface: ['eth', 'en']
matchIpv4: 192.168.0.0/24
matchInterface: eth\\d*