Ibm Installation Manager Following Repositories Are Not Connected
To generate an encrypted password, use the grub2-mkpasswd-pbkdf2 command, enter the password you want to use, and copy the command's output (the hash starting with grub.pbkdf2) into the Kickstart file. An example bootloader Kickstart entry with an encrypted password looks similar to the following: bootloader --iscrypted --password=grub.pbkdf2.sha512.C6C9832F3AC3D149AC0B24BE69E2D4FB0DBEEDBD29CA1D30A044DE2645C4C7A291E585D4DC43F8A4D82479F8B95CA4BA43B75E8E0BB2938990.C688B6F0EF935701FF9BD1A8EC7FE5BD233370C5CC8F1A2A233DE22C83705BB614EA17F3FDFDF4AC2161CEA3384E56EB38A2EC47405E •.
Multipath devices that use LVM are not assembled until after Anaconda has parsed the Kickstart file. Therefore, you cannot specify these devices in the format dm-uuid-mpath. Instead, to specify a multipath device that uses LVM, use the format disk/by-id/scsi- WWID, where WWID is the world-wide identifier for the device.
For example, to specify a disk with WWID 58095BEC5510947BE8C0318, use: part / --fstype=xfs --grow --asprimary --size=8192 --ondisk=disk/by-id/scsi-58095BEC5510947BE8C0318. Part raid.01 --size=6000 --ondisk=sda part raid.02 --size=6000 --ondisk=sdb part raid.03 --size=6000 --ondisk=sdc part swap --size=512 --ondisk=sda part swap --size=512 --ondisk=sdb part swap --size=512 --ondisk=sdc part raid.11 --size=1 --grow --ondisk=sda part raid.12 --size=1 --grow --ondisk=sdb part raid.13 --size=1 --grow --ondisk=sdc raid / --level=1 --device=rhel7-root --label=rhel7-root raid.01 raid.02 raid.03 raid /home --level=5 --device=rhel7-home --label=rhel7-home raid.11 raid.12 raid.13 realm (optional). The user file-creation mask can be controlled with the umask command. The default setting of the user file-creation mask for new users is defined by the UMASK variable in the /etc/login.defs configuration file on the installed system. If unset, it defaults to 022. This means that by default when an application creates a file, it is prevented from granting write permission to users other than the owner of the file.
However, this can be overridden by other settings or scripts. More information can be found in the. Vnc (optional).
Here in this article we will Install WAS 8 using the IBM installation manager tool, please follow below steps by step process to complete the same. If you choose this option, IIM will contact the IBM online repositories and scan for IIM updates. Click Next to continue. If IIM is not already running, launch IIM. Feb 13, 2011 - 3 min - Uploaded by Gilbert HerschbergerMake your local repository on Linux available to the IBM Packaging Manager over HTTP.
The *-comps- variant. Architecture.xml file contains a structure describing available environments (marked by the tag) and groups (the tag). Each entry has an ID, user visibility value, name, description, and package list. If the group is selected for installation, the packages marked mandatory in the package list are always installed, the packages marked default are installed if they are not specifically excluded elsewhere, and the packages marked optional must be specifically included elsewhere even when the group is selected.
Initial Setup does not run after a system is installed from a Kickstart file unless a desktop environment and the X Window System were included in the installation and graphical login was enabled. This means that by default, no users except for root are created. You can either create a user with the user option in the Kickstart file before installing additional systems from it (see for details) or log into the installed system with a virtual console as root and add users with the useradd command. The%pre script can be used for activation and configuration of networking and storage devices. It is also possible to run scripts, using interpreters available in the installation environment. Adding a%pre script can be useful if you have networking and storage that needs special configuration before proceeding with the installation, or have a script that, for example, sets up additional logging parameters or environment variables.
Debugging problems with%pre scripts can be difficult, so it is recommended only to use a%pre script when necessary. Starting with Red Hat Enterprise Linux 7, Kickstart installations can contain custom scripts which are run when the installer encounters a fatal error - for example, an error in a package that has been requested for installation, failure to start VNC when specified, or an error when scanning storage devices. Installation cannot continue after such an error has occured. The installer will run all%onerror scripts in the order they are provided in the Kickstart file. In addition,%onerror scripts will be run in the event of a traceback.
DON'T PANIC The Hitchhiker's Guide to the Galaxy, Douglas Adams Great, kid. Don't get cocky. Han Solo, Star Wars The WebSphere Application Server Performance Cookbook covers performance tuning for (WAS), although there is also a very strong focus on,, and which can be applied to other products and environments. The cookbook is designed to be read in a few different ways: • On the go: Readers short on time should skip to the chapter at the end of the book. In the spirit of a cookbook, there are recipes that provide step-by-step instructions of how to gather and analyze data for particular classes of problems. • General areas: For readers interested in tuning some general area such as or, each major chapter provides its recipe at the top of the chapter that summarizes the key tuning knobs that should be investigated. • Deep dive: Readers interested in end-to-end tuning are encouraged to skim the entire book for areas relevant to their product usage.
In general, this book is not intended to be read end-to-end. A large portion of the cookbook is more of a reference book. The nature of performance tuning is that 80% of the time, you should focus on a few key things, but 20% of the time you may need to deep dive into a very specific component.
The high level topics covered in the book in depth are:,,, and more. Note: Before using this information, read the section for information on terms of use, statement of support, trademarks, etc. A public developerWorks forum exists for feedback and discussion: Copyright International Business Machines Corporation 2017. All rights reserved. Government Users Restricted Rights: Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corporation. A 2008 survey of 160 organizations (average revenue $1.3 billion) found a typical impact of a one second delay in response times for web applications entailed a potential annual revenue loss of $117 million, 11% fewer page views, 7% fewer conversions, 16% lower customer satisfaction, brand damage, more support calls, and increased costs (Customers are Won or Lost in One Second, Aberdeen Group, 2008, ).
Other benefits include reduced hardware needs and reduced costs, reduced maintenance, reduced power consumption, knowing your breaking points, accurate system sizing, etc. 'Increased performance can often involve sacrificing a certain level of feature or function in the application or the application server. The tradeoff between performance and feature must be weighed carefully when evaluating performance tuning changes.'
() A typical performance exercise can yield a throughput improvement of about 200% relative to default tuning parameters (). In general, the goal of performance tuning is to increase throughput, reduce response times, and increase the capacity for concurrent requests, all balanced against costs.
• A response time is the time taken to complete a unit of work. For example, the time taken to complete an HTTP response. • Concurrent requests is the count of requests processing at the same time. For example, the number of HTTP requests concurrently being processed. A single user may send multiple concurrent requests.
• Throughput is the number of successful responses over a period of time; for example, successful HTTP responses per second. Throughput is proportional to response times and concurrent requests. When throughput saturates, response times will increase. Revo Uninstaller Pro Serial Number 2014.
In the heavy load zone or Section B, as the concurrent client load increases, throughput remains relatively constant. However, the response time increases proportionally to the user load. That is, if the user load is doubled in the heavy load zone, the response time doubles. At some point, represented by Section C, the buckle zone, one of the system components becomes exhausted. At this point, throughput starts to degrade.
For example, the system might enter the buckle zone when the network connections at the web server exhaust the limits of the network adapter or if the requests exceed operating system limits for file handles. • A hypothesis is a testable idea. It is not believed to be true nor false. • A theory is the result of testing a hypothesis and getting a positive result. It is believed to be true. Consider the following methods to eliminate a bottleneck (): • Reduce the demand • Increase resources • Improve workload distribution • Reduce synchronization Reducing the demand for resources can be accomplished in several ways. Caching can greatly reduce the use of system resources by returning a previously cached response, thereby avoiding the work needed to construct the original response.
Caching is supported at several points in the following systems: • IBM HTTP Server • Command • Enterprise bean • Operating system Application code profiling can lead to a reduction in the CPU demand by pointing out hot spots you can optimize. IBM Rational and other companies have tools to perform code profiling. An analysis of the application might reveal areas where some work might be reduced for some types of transactions. Change tuning parameters to increase some resources, for example, the number of file handles, while other resources might need a hardware change, for example, more or faster CPUs, or additional application servers.
Key tuning parameters are described for each major WebSphere Application Server component to facilitate solving performance problems. Also, the performance advisors page can provide advice on tuning a production system under a real or simulated load. Workload distribution can affect performance when some resources are underutilized and others are overloaded.
WebSphere Application Server workload management functions provide several ways to determine how the work is distributed. Workload distribution applies to both a single server and configurations with multiple servers and nodes. Some critical sections of the application and server code require synchronization to prevent multiple threads from running this code simultaneously and leading to incorrect results. Synchronization preserves correctness, but it can also reduce throughput when several threads must wait for one thread to exit the critical section. When several threads are waiting to enter a critical section, a thread dump shows these threads waiting in the same procedure. Synchronization can often be reduced by: changing the code to only use synchronization when necessary; reducing the path length of the synchronized code; or reducing the frequency of invoking the synchronized code. It is always important to consider what happens when some part of a cluster crashes.
Will the rest of the cluster handle it gracefully? Does the heap size have enough head room?
Is there enough CPU to handle extra load, etc.? If there is more traffic than the cluster can handle, will it queue and timeout gracefully?
Begin by understanding that one cannot solve all problems immediately. We recommend prioritizing work into short term (high), 3 months (medium) and long term (low). How the work is prioritized depends on the business requirements and where the most pain is being felt. Guide yourself primarily with tools and methodologies.
Gather data, analyze it, create hypotheses, and test your hypotheses. Rinse and repeat. In general, we advocate a bottom-up approach. For example, with a typical WebSphere Application Server application, start with the operating system, then Java, then WAS, then the application, etc.
The following are some example scenarios and approaches (). They are specific to particular products and symptoms and we just want to highlight them to give you a taste of how to do performance tuning. Later chapters will go through the details. Suggested solution: Utilize request metrics to determine how much each component is contributing to the overall response time. Focus on the component accounting for the most time. Use the Tivoli Performance Viewer to check for resource consumption, including frequency of garbage collections. You might need code profiling tools to isolate the problem to a specific method.
Suggested solution: Check to determine if any systems have high CPU, network or disk utilization and address those. For clustered configurations, check for uneven loading across cluster members.
Suggested solutions: • Check that work is reaching the system under test. Ensure that some external device does not limit the amount of work reaching the system. Tivoli Performance Viewer helps determine the number of requests in the system.
• A thread dump might reveal a bottleneck at a synchronized method or a large number of threads waiting for a resource. • Make sure that enough threads are available to process the work both in IBM HTTP Server, database, and the application servers. Conversely, too many threads can increase resource contention and reduce throughput.
• Monitor garbage collections with Tivoli Performance Viewer or the verbosegc option of your Java virtual machine. Excessive garbage collection can limit throughput. If you need assistance, IBM Software Services for WebSphere (ISSW) provides professional consultants to help: 1. Methodically capture data and logs for each test and record results in a spreadsheet. In general, it is best to change one varaible at a time. Example test matrix: Test # Start Time Ramped Up End Time Concurrent Users Average Throughput (Responses per Second) Average Response Time (ms) Average WAS CPU% Average Database CPU% 1 1/1/2014 14:00 GMT 1/1/2014 14:30 GMT 1/1/2014 16:00 GMT 10 50 100 25 25 2.
Use a flow chart that everyone agrees to. Otherwise, alpha personalities or haphazard and random testing are likely to prevail, and these are less likely to succeed. The following is just an example. Depth first means first 'fill in' application server JVMs within a node before scaling across multiple nodes. The following are example hypotheses that are covered in more detail in each product chapter.
They are summarized here just for illustration of hypotheses: • CPU is low, so we can increase threads. • CPU is low, so there is lock contention (gather monitor contention data through a sampling profiler such as IBM WAIT or IBM Java Health Center). • CPU is high, so we can decrease threads or investigate possible code issues (gather profiling data through a sampling profiler such as IBM WAIT or IBM Java Health Center). • Garbage collection overhead is high, so we can tune it. • Connection pool wait times are high, so we can increase the size of the connection pool (if the total number of connections do not exceed the limits in the database). • Database response times are high (also identified in thread dumps with many threads stuck in SQL calls), so we can investigate the database.
Deeply understand the logical, physical, and network layout of the systems. Create a rough diagram of the relevant components and important details. For example, how are the various systems connected and do they share any resources (potential bottlenecks) such as networks, buses, etc?
Are the operating systems virtualized? It's also useful to understand the processor layout and in particular, the L2/L3 cache (and NUMA) layouts as you may want to 'carve out' processor sets along these boundaries. Most, if not all, benchmarks have a target maximum concurrent user count. This is usually the best place to start when tuning the various queue sizes, thread pools, etc.
Averages should be used instead of spot observations. For important statistics such as throughput, getting standard deviations would be ideal.
Each test should have a sufficient 'ramp up' period before data collection starts. Applications may take time to cache certain content and the Java JIT will take time to optimally compile hot methods. Monitor all parts of the end-to-end system.
Consider starting with an extremely simplified application to ensure that the desired throughput can be achieved. Incrementally exercise each component: for example, a Hello World servlet, followed by a servlet that does a simple select from a database, etc. This lets you confirm that end-to-end 'basics' work, including the load testing apparatus. Run a saturation test where everything is pushed to the maximum (may be difficult due to lack of test data or test machines). Make sure things don't crash or break. It's common wisdom that one should always change one variable at a time when investigating problems, performance testing, etc.
The idea is that if you change more than one variable at a time, and the problem goes away, then you don't know which one solved it. For example, let's say one changes the garbage collection policy, maximum heap size, and some of the application code, and performance improves, then one doesn't know what helped. The premise underlying this wisdom is that all variables are independent, which is sometimes (maybe usually, to different degrees) not the case. In the example above, the garbage collection policy and maximum heap size are intimately related. For example, if you change the GC policy to gencon but don't increase the maximum heap size, it may not be a fair comparison to a non-gencon GC policy, because the design of gencon means that some proportion of the heap is no longer available relative to non-gencon policies (due to the survivor space in the nursery, based on the tilt ratio). See What's even more complicated is that it's often difficult to reason about variable independence.
For example, most variables have indirect effects on processor usage or other shared resources, and these can have subtle effects on other variables. The best example is removing a bottleneck at one tier overloads another tier and indirectly affects the first tier (or exercises a new, worse bottleneck). So what should one do?
To start, accept that changing one variable at a time is not always correct; however, it's often a good starting point. Unlesss there's a reason to believe that changing multiple, dependent variables makes sense (for example, comparing gencon to non-gencon GC policies), then it's fair to assume initially that, even if variables may not be truly independent, the impact of one variable commonly drowns out other variables. Just remember that ideally you would test all combinations of the variables.
Unfortunately, as the number of variables increases, the number of tests increases exponentially. Specifically, for N variables, there are (2 N - 1) combinations. For example, for two variables A and B, you would test A by itself, B by itself, and then A and B together (2 2 - 1 = 3). However, by just adding two more variables to make the total four variables, it goes up to 15 different tests.
There are three reasons to consider this question: First, it's an oversimplification to think that one should always change one variable at a time, and it's important to keep in the back of one's head that if changing one variable at a time doesn't work, then changing multiple variables at a time might (of course, they might also just be wrong or inconsequential variables). Second, particularly for performance testing, even if changing a single variable improves performance, it's possible that changing some combination of variables will improve performance even more. Which is to say that changing a single variable at a time is non-exhaustive. Finally, it's not unreasonable to try the alternative, scattershot approach first of changing all relevant variables at the same time, and if there are benefits, removing variables until the key ones are isolated. This is more risky because there could be one variable that makes an improvement and another that cancels that improvement out, and one may conclude too much from this test.
However, one can also get lucky by observing some interesting behavior from the results and then deducing what the important variable(s) are. This is sometimes helpful when one doesn't have much time and is feeling lucky (or has some gut feelings to support this approach). So what's the answer to the question, 'Is changing one variable at a time always correct?'
No, it's not always correct. Moreover, it's not even optimal, because it's non-exhaustive. But it usually works. When a naval ship declares 'battle stations' there is an operations manual that every sailor on the ship is familiar with, knows where they need to go and what they need to do. Much like any navy when a problem occurs that negatively affects the runtime environment it is helpful for everyone to know where they need to be and who does what. Each issue that occurs is an educational experience. Effective organizations have someone on the team taking notes.
This way when history repeats itself the team can react more efficiently. Even if a problem does not reappear the recorded knowledge will live on. Organizations are not static. People move on to new projects and roles.
The newly incoming operations team members can inherit the documentation to see how previous problems were solved. For each problem we want to keep a record of the following points: • Symptom(s) of the problem - brief title • More detailed summary of the problem. • Who reported the problem? • What exactly is the problem? • Summary of all the people that were involved in troubleshooting and what was their role? The role is important because it will help the new team understand what roles need to exist. • Details of • What data was collected?
• Who looked at the data? • The result of their analysis • What recommendations were made • Did the recommendations work (i.e. Fix the problem)? Applications are typically designed with specific end user scenarios documented as use cases (for example, see the book Writing Effective Use Cases by Alistair Cockburn). Use cases drive the test cases that are created for load testing.
A common perception in IT is performance testing can be accommodated by what is know as the 80/20 rule: We will test what 80% of actions the users do and ignore the 20% they do not as frequently. However, what is not addressed are the 20% that can induce a negative performance event causing serious performance degradation to the other 80%. Performance testing should always test 100% of the documented use cases. The 80/20 rule also applies to how far you should tune. You can increase performance by disabling things such as performance metrics (PMI) and logging, but this may sacrifice serviceability and maintenance. Unless you're actually benchmarking for top speed, then we do not recommend applying such tuning.
Begin by choosing a benchmark, a standard set of operations to run. This benchmark exercises those application functions experiencing performance problems. Complex systems frequently need a warm-up period to cache objects, optimize code paths, and so on. System performance during the warm-up period is usually much slower than after the warm-up period. The benchmark must be able to generate work that warms up the system prior to recording the measurements that are used for performance analysis.
Depending on the system complexity, a warm-up period can range from a few thousand transactions to longer than 30 minutes. Another key requirement is that the benchmark must be able to produce repeatable results.
If the results vary more than a few percent from one run to another, consider the possibility that the initial state of the system might not be the same for each run, or the measurements are made during the warm-up period, or that the system is running additional workloads. Several tools facilitate benchmark development. The tools range from tools that simply invoke a URL to script-based products that can interact with dynamic data generated by the application. IBM® Rational® has tools that can generate complex interactions with the system under test and simulate thousands of users. Producing a useful benchmark requires effort and needs to be part of the development process. Do not wait until an application goes into production to determine how to measure performance.
The benchmark records throughput and response time results in a form to allow graphing and other analysis techniques. Reset as many variables possible on each test. This is most important for tests involving databases which tend to accumulate data and can negatively impact performance.
If possible, data should be truncated & reloaded on each test. Determine the level of performance that will be considered acceptable to the customer at the outset of the engagement.
Ensure that the objective is clear, measurable, and achievable. Examples of this are: • To show that the proposed WESB application can mediate complex messages 10 KB in size at a rate of 1,000 per second or more, on an 8 processor core pSeries Power6 system. • To show that a WPS workflow application can handle 600 simultaneous users, where each human user’s average think time is 5 minutes and the response time is under 5 seconds, on an 8 processor core pSeries Power6 system. As a counter-example, here is a poorly defined objective: 'WPS performance must not be the bottleneck in the overall throughput of the solution, and must be able to handle peak load.'
Problems with this objective include: how do you prove that WPS is not the bottleneck? What is the definition of peak load? What are the attributes of the solution? What hardware will be used? There are various commercial products such as Rational Performance Tester (). If such a tool is not available, there are various open source alternatives such as Apache Bench, Apache JMeter, Siege, and OpenSTA. The Apache JMeter tool is covered in more detail in the and it is a generally recommended tool.
Apache Bench is a binary distributed in the 'bin' folder of the httpd package (and therefore with IBM HTTP Server as well). It can do very simple benchmarking of a single URL, specifying the total number of requests (-n) and the concurrency at which to send the requests (-c). Additionally, see the chapter for your particular operating system: • • • • • • • A processor is an integrated circuit (also known as a socket or die) with one or more central processing unit (CPU) cores. A CPU core executes program instructions such as arithmetic, logic, and input/output operations. CPU utilization is the percentage of time that programs or the operating system execute as opposed to idle time. A CPU core may support simultaneous multithreading (also known as hardware threads or hyperthreads) which appears to the operating system as additional logical CPU cores. Be aware that simple CPU utilization numbers may be unintuitive in the context of advanced processor features: The current implementation of [CPU utilization].
Shows the portion of time slots that the CPU scheduler in the OS could assign to execution of running programs or the OS itself; the rest of the time is idle. The advances in computer architecture made this algorithm an unreliable metric because of introduction of multi core and multi CPU systems, multi-level caches, non-uniform memory, simultaneous multithreading (SMT), pipelining, out-of-order execution, etc.
A prominent example is the non-linear CPU utilization on processors with Intel® Hyper-Threading Technology (Intel® HT Technology). Intel® HT technology is a great performance feature that can boost performance by up to 30%. However, HT-unaware end users get easily confused by the reported CPU utilization: Consider an application that runs a single thread on each physical core. Then, the reported CPU utilization is 50% even though the application can use up to 70%-100% of the execution units. () Use care when partitioning [CPU cores]. It’s important to recognize that [CPU core] partitioning doesn’t create more resources, it simply enables you to divide and allocate the [CPU core] capacity. At the end of the day, there still needs to be adequate underlying physical CPU capacity to meet response time and throughput requirements when partitioning [CPU cores].
Otherwise, poor performance will result. It is not necessarily problematic for a machine to have many more program threads than processor cores. This is common with Java and WAS processes that come with many different threads and thread pools by default that may not be used often.
Even if the main application thread pool (or the sum of these across processes) exceeds the number of processor cores, this is only concerning if the average unit of work uses the processor heavily. For example, if threads are mostly I/O bound to a database, then it may not be a problem to have many more threads than cores. There are potential costs to threads even if they are usually sleeping, but these may be acceptable. The danger is when the concurrent workload on available threads exceeds processor capacity.
There are cases where thread pools are excessively large but there has not been a condition where they have all filled up (whether due to workload or a front-end bottleneck). It is very important that stress tests saturate all commonly used thread pools to observe worst case behavior. Depending on the environment, number of processes, redundancy, continuous availability and/or high availability requirements, the threshold for%CPU utilization varies.
For high availability and continuous availability environments, the threshold can be as low as 50% CPU utilization. For non-critical applications, the threshold could be as high as 95%. Analyze both the non-functional requirements and service level agreements of the application in order to determine appropriate thresholds to indicate a potential health issue. It is common for some modern processors (including server class) and operating systems to enable processor scaling by default.
The purpose of processor scaling is primarily to reduce power consumption. Processor scaling dynamically changes the frequency of the processor(s), and therefore may impact performance. In general, processor scaling should not kick in during periods of high use; however, it does introduce an extra performance variable.
Weigh the energy saving benefits versus disabling processor scaling and simply running the processors at maximum speed at all times (usually done in the BIOS). Test affinitizing processes to processor sets (operating system specific configuration). In general, affinitize within processor boundaries. Also, start each JVM with -XgcthreadsN (IBM Java) or -XX:ParallelGCThreads=N (Oracle/HotSpot Java) where N equals the number of processor core threads in the processor set.
It is sometimes worth understanding the physical architecture of the central processing units (CPUs). Clock speed and number of cores/hyperthreading are the most obviously important metrics, but CPU memory locality, bus speeds, and L2/L3 cache sizes are sometimes worth considering. One strategy for deciding on the number of JVMs is to create one JVM per processor chip (i.e. Socket) and bind it to that chip.
It's common for operating systems to dedicate some subset of CPU cores for interrupt processing and this may distort other workloads running on those cores. Different types of CPU issues (Old Java Diagnostic Guide): • Inefficient or looping code is running. A specific thread or a group of threads is taking all the CPU time. • Points of contention or delay exist.
CPU usage is spread across most threads, but overall CPU usage is low. • A deadlock is present. No CPU is being used. As a starting point, I plan on having at least one CPU [core] per application server JVM; that way I have likely minimized the number of times that a context switch will occur -- at least as far as using up a time slice is concerned (although, as mentioned, there are other factors that can result in a context switch). Unless you run all your servers at 100% CPU, more than likely there are CPU cycles available as application requests arrive at an application server, which in turn are translated into requests for operating system resources.
Therefore, we can probably run more application servers than CPUs. Arriving at the precise number that you can run in your environment, however, brings us back to it depends. This is because that number will in fact depend on the load, application, throughput, and response time requirements, and so on, and the only way to determine a precise number is to run tests in your environment. In general one should tune a single instance of an application server for throughput and performance, then incrementally add [processes] testing performance and throughput as each [process] is added. By proceeding in this manner one can determine what number of [processes] provide the optimal throughput and performance for their environment. In general once CPU utilization reaches 75% little, if any, improvement in throughput will be realized by adding additional [processes].
Random access memory (RAM) is a high speed, ephemeral data storage circuit located near CPU cores. RAM is often referred to as physical memory to contrast it to virtual memory. Physical memory comprises the physical storage units which support memory usage in a computer (apart from CPU core memory registers), whereas virtual memory is a logical feature that an operating system provides for isolating and simplifying access to physical memory. Strictly speaking, physical memory and RAM are not synonymous because physical memory includes paging space, and paging space is not RAM. Paging space is a subset of physical memory, often disk storage or a solid state drive (SSD), which the operating system uses as a 'spillover' when demands for physical memory exceed available RAM. Historically, swapping referred to paging in or out an entire process; however, many use paging and swapping interchangeably today, and both address page-sized units of memory (e.g. Overcommitting memory occurs when less RAM is available than the peak in-use memory demand.
This is either done accidentally (undersizing) or consciously with the premise that it is unlikely that all required memory will be accessed at once. Overcommitting is dangerous because the process of paging in and out may be time consuming. RAM operates at over 10s of GB/s, whereas even the fastest SSDs operate at a maximum of a few GB/s (often the bottleneck is the interface to the SSD, e.g. Overcommitting memory is particularly dangerous with Java because some types of garbage collections will need to read most of the whole virtual address space for a process in a short period of time.
When paging is very heavy, this is called memory thrashing, and usually this will result in a total performance degradation of the system of multiple magnitudes. Some people recommend sizing the paging files to some multiple of RAM; however, this recommendation is a rule of thumb that may not be applicable to many workloads. Some people argue that paging is worse than crashing because a system can enter a zombie-like state and the effect can last hours before an administrator is alerted and investigates the issue. Investigation itself may be difficult because connecting to the system may be slow or impossible while it is thrashing. Therefore, some decide to dramatically reduce paging space (e.g.
10 MB) or remove the paging space completely which will force the operating system to crash processes that are using too much memory. This creates clear and immediate symptoms and allows the system to potentially restart the processes and recover. A tiny paging space is probably preferable to no paging space in case the operating system decides to do some benign paging. A tiny paging space can also be monitored as a symptom of problems. Some workloads may benefit from a decently sized paging space. For example, infrequently used pages may be paged out to make room for filecache, etc.
'Although most do it, basing page file size as a function of RAM makes no sense because the more memory you have, the less likely you are to need to page data out.' (Russinovich & Solomon) Non-uniform Memory Access (NUMA) is a design in which RAM is partitioned so that subsets of RAM (called NUMA nodes) are 'local' to certain processors. Consider affinitizing processes to particular NUMA nodes. Whether 32-bit or 64-bit will be faster depends on the application, workload, physical hardware, and other variables. All else being equal, in general, 32-bit will be faster than 64-bit because 64-bit doubles the pointer size, therefore creating more memory pressure (lower CPU cache hits, TLB, etc.).
However, all things are rarely equal. For example, 64-bit often provides more CPU registers than 32-bit (this is not always the case, such as Power), and in some cases, the benefits of more registers outweigh the memory pressure costs. There are other cases such as some mathematical operations where 64-bit will be faster due to instruction availability (and this may apply with some TLS usage, not just obscure mathematical applications).
Java significantly reduces the impact of the larger 64-bit pointers within the Java heap by using compressed references. With all of that said, in general, the industry is moving towards 64-bit and the performance difference for most applications is in the 5% range. Several platforms support using memory pages that are larger than the default memory page size. Depending on the platform, large memory page sizes can range from 4 MB (Windows) to 16 MB (AIX) and up to 1GB versus the default page size of 4KB. Many applications (including Java-based applications) often benefit from large pages due to a reduction in CPU overhead associated with managing smaller numbers of large pages. See Large pages may cause a small throughput improvement.
In one benchmark, about 2% (). Some recent benchmarks on very modern hardware have found little benefit to large pages, although no negative consequences so they're still a best practice in most cases. Many problems are caused by exhausted disk space.
It is critical that disk space is monitored and alerts are created when usage is very high. Disk speed may be an important factor in some types of workloads. Some operating systems support mounting physical memory as disk partitions (sometimes called RAMdisks), allowing you to target certain disk operations that have recreatable contents to physical memory instead of slower disks. Ensure that NICs and switches are configured to use their top speeds and full duplex mode. Sometimes this needs to be explicitly done, so you should not assume that this is the case by default. In fact, it has been observed that when the NIC is configured for auto-negotiate, sometimes the NIC and the switch can auto-negotiate very slow speeds and half duplex. This is why setting explicit values is recommended.
If the network components support Jumbo Frames, consider enabling it across the relevant parts of the network Check network performance between two hosts. For example, make a 1 GB file (various operating system commands like dd or mkfile).
Then test the network throughput by copying it using FTP, SCP, etc. Monitor ping latency between hosts, particularly any periodic large deviations. It is common to have separate NICs for incoming traffic (e.g. HTTP requests) and for backend traffic (e.g. In some cases and particularly on some operating systems, this setup may perform worse than a single NIC (as long as it doesn't saturate) probably due to interrupts and L2/L3 cache utilization side-effects. TCP/IP is used for most network communications such as HTTP, so understanding and optimizing the operating system TCP/IP stack can have dramatic upstream effects on your application. TCP/IP is normally used in a fully duplexed mode meaning that communication can occur asynchronously in both directions.
In such a mode, a distinction between 'client' and 'server' is arbitrary and sometimes can confuse investigations (for example, if a web browser is uploading a large HTTP POST body, it is first the 'server' and then becomes the 'client' when accepting the response). You should always think of a set of two sender and receiver channels for each TCP connection. TCP/IP is a connection oriented protocol, unlike UDP, and so it requires handshakes (sets of packets) to start and close connections. The establishing handshake starts with a SYN packet from sender IP address A on an ephemeral local port X to receiver IP address B on a port Y (every TCP connection is uniquely identified by this 4-tuple). If the connection is accepted by B, then B sends back an acknowledgment (ACK) packet as well as its own SYN packet to establish the fully duplexed connection (SYN/ACK).
Finally, A sends a final ACK packet to acknowledge the established connection. This handshake is commonly referred to as SYN, SYN/ACK, ACK. A TCP/IPv4 packet has a 40 byte header (20 for TCP and 20 for IPv4). Network performance debugging (often euphemistically called 'TCP tuning') is extremely difficult because nearly all flaws have exactly the same symptom: reduced performance.
For example, insufficient TCP buffer space is indistinguishable from excess packet loss (silently repaired by TCP retransmissions) because both flaws just slow the application, without any specific identifying symptoms. The amount of data that can be in transit in the network, termed 'Bandwidth-Delay-Product,' or BDP for short, is simply the product of the bottleneck link bandwidth and the Round Trip Time (RTT). In general, the maximum socket receive and send buffer sizes should be greater than the average BDP. TCP/IP flow control allows a sender to send more packets before receiving acknowledgments for previous packets.
Flow control also tries to ensure that a sender does not send data faster than a receiver can handle. The receiver includes a 'window size' in each acknowledgment packet which tells the sender how much buffer room the receiver has for future packets. If the window size is 0, the sender should stop sending packets until it receives a TCP Window Update packet or an internal retry timer fires. If the window size is non-zero, but it is too small, then the sender may spend unnecessary time waiting for acknowledgments. The window sizes are directly affected by the rate at which the application can produce and consume packets (for example, if CPU is 100% then a program may be very slow at producing and consuming packets) as well as operating system TCP sending and receiving buffer size limits. The buffers are chunks of memory allocated and managed by the operating system to support TCP/IP flow control.
It is generally advisable to increase these buffer size limits as much as operating system configuration, physical memory and the network architecture can support. The maximum throughput based on the receiver window is rwnd/RTT. TCP sockets pass through various states such as LISTENING, ESTABLISHED, CLOSED, etc.
One particularly misunderstood state is the TIME_WAIT state which can sometimes cause scalability issues. A full duplex close occurs when sender A sends a FIN packet to B to initiate an active close (A enters FIN_WAIT_1 state).
When B receives the FIN, it enters CLOSE_WAIT state and responds with an ACK. When A receives the ACK, A enters FIN_WAIT_2 state. Strictly speaking, B does not have to immediately close its channel (if it wanted to continue sending packets to A); however, in most cases it will initiate its own close by sending a FIN packet to A (B now goes into LAST_ACK state).
When A receives the FIN, it enters TIME_WAIT and sends an ACK to B. The reason for the TIME_WAIT state is that there is no way for A to know that B received the ACK. The TCP specification defines the maximum segment lifetime (MSL) to be 2 minutes (this is the maximum time a packet can wander the net and stay valid). The operating system should ideally wait 2 times MSL to ensure that a retransmitted packet for the FIN/ACK doesn't collide with a newly established socket on the same port (for instance, if the port had been immediately reused without a TIME_WAIT and if other conditions such as total amount transferred on the packet, sequence number wrap, and retransmissions occur). This behavior can cause scalability issues: Because of TIME-WAIT state, a client program should choose a new local port number (i.e., a different connection) for each successive transaction.
However, the TCP port field of 16 bits (less the 'well-known' port space) provides only 64512 available user ports. This limits the total rate of transactions between any pair of hosts to a maximum of 64512/240 = 268 per second. () Most operating systems do not use 4 minutes as the default TIME_WAIT duration because of the low probability of the wandering packet problem and other mitigating factors. Nevertheless, if you observe socket failures accompanied with large numbers of sockets in TIME_WAIT state, then you should reduce the TIME_WAIT duration further. Conversely, if you observe very strange behavior when new sockets are created that can't be otherwise explained, you should use 4 minutes as a test to ensure this is not a problem. Finally, it's worth noting that some connections will not follow the FIN/ACK, FIN/ACK procedure, but may instead use FIN, FIN/ACK, ACK, or even just a RST packet (abortive close).
There is a special problem associated with small packets. When TCP is used for the transmission of single-character messages originating at a keyboard, the typical result is that 41 byte packets (one byte of data, 40 bytes of header) are transmitted for each byte of useful data. This 4000% overhead is annoying but tolerable on lightly loaded networks. On heavily loaded networks, however, the congestion resulting from this overhead can result in lost datagrams and retransmissions, as well as excessive propagation time caused by congestion in switching nodes and gateways.
The solution is to inhibit the sending of new TCP segments when new outgoing data arrives from the user if any previously transmitted data on the connection remains unacknowledged. () In practice, enabling Nagle's algorithm (which is usually enabled by default) means that TCP will not send a new packet if another previous sent packet is still unacknowledged, unless it has 'enough' coalesced data for a larger packet. The native C setsockopt option to disable Nagle's algorithm is TCP_NODELAY: This option can usually be set globally at an operating system level.
This option is also exposed in Java's StandardSocketOptions to allow for setting a particular Java socket option: In WebSphere Application Server, TCP_NODELAY is explicitly enabled by default for all WAS TCP channel sockets. TCP delayed acknowledgments was designed in the late 1980s in an environment of baud speed modems.
Delaying acknowledgments was a tactic used when communication over wide area networks was really slow and the delaying would allow for piggy-backing acknowledgment packets to responses within a window of a few hundred milliseconds. In modern networks, these added delays may cause significant latencies in network communications. Delayed acknowledgments is a completely separate function from Nagle's algorithm (TCP_NODELAY). Both act to delay packets in certain situations. This can be very subtle; for example, on AIX, the option for the former is tcp_nodelayack and the option for the latter is tcp_nodelay. Delayed ACKs defines the default behavior to delay acknowledgments up to 500 milliseconds (the common default maximum is 200 milliseconds) from when a packet arrives (but no more than every second segment) to reduce the number of ACK-only packets and ACK chatter because the ACKs may piggy back on a response packet.
It may be the case that disabling delayed ACKs, while increasing network chatter and utilization (if an ACK only packet is sent where it used to piggy back a data packet, then there will be an increase in total bytes sent because of the increase in the number of packets and therefore TCP header bytes), may improve throughput and responsiveness. However, there are also cases where delayed ACKs perform better. It is best to test the difference. 'A host that is receiving a stream of TCP data segments can increase efficiency in both the Internet and the hosts by sending fewer than one ACK (acknowledgment) segment per data segment received; this is known as a 'delayed ACK' [TCP:5].
A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment. A delayed ACK gives the application an opportunity to update the window and perhaps to send an immediate response. In particular, in the case of character-mode remote login, a delayed ACK can reduce the number of segments sent by the server by a factor of 3 (ACK, window update, and echo character all combined in one segment). In addition, on some large multi-user hosts, a delayed ACK can substantially reduce protocol processing overhead by reducing the total number of packets to be processed [TCP:5]. However, excessive delays on ACK's can disturb the round-trip timing and packet 'clocking' algorithms [TCP:7].' () Delayed acknowledgments interacts poorly with Nagle's algorithm. For example, if A sent a packet to B, and B is waiting to send an acknowledgment to A until B has some data to send (Delayed Acknowledgments), and if A is waiting for the acknowledgment (Nagle's Algorithm), then a delay is introduced.
In Wireshark, you can look for the 'Time delta from previous packet' entry for the ACK packet to determine the amount of time elapsed waiting for the ACK. Although delayed acknowledgment may adversely affect some applications., it can improve performance for other network connections. () The pros of delayed acknowledgments are: • Reduce network chatter • Reduce potential network congestion • Reduce network interrupt processing (CPU) The cons of delayed acknowledgments are: • Potentially reduce response times and throughput In general, if two hosts are communicating on a LAN and there is sufficient additional network capacity and there is sufficient additional CPU interrupt processing capacity, then disabling delayed acknowledgments will tend to improve performance and throughput. However, this option is normally set at an operating system level, so if there are any sockets on the box that may go out to a WAN, then their performance and throughput may potentially be affected negatively. Even on a WAN, for 95% of modern internet connections, disabling delayed acknowledgments may prove beneficial. The most important thing to do is to test the change with real world traffic, and also include tests emulating users with very slow internet connections and very far distances to the customer data center (e.g. Second long ping times) to understand any impact.
The other potential impact of disabling delayed acknowledgments is that there will be more packets which just have the acknowledgment bit set but still have the TCP/IP header (40 or more bytes). This may cause higher network utilization and network CPU interrupts (and thus CPU usage).
These two factors should be monitored before and after the change. 'With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. [With a] Selective Acknowledgment (SACK) mechanism. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received.
The sender can then retransmit only the missing data segments.' () The listen back log is a limited size queue for each socket that holds pending sockets that have completed the SYN packet but that the process has not yet 'accepted' (therefore they are not yet established). This back log is used as an overflow for sudden spikes of connections. If the listen back log fills up any new connection attempts (SYN packets) will be rejected by the operating system (i.e. They'll fail). As with all queues, you should size them just big enough to handle a temporary but sudden spike, but not too large so that too much operating system resources are used which means that new connection attempts will fail fast when there is a backend problem.
There is no science to this, but 511 is a common value. RFC 1122 defines a 'keep-alive' mechanism to periodically send packets for idle connections to make sure they're still alive: A 'keep-alive' mechanism periodically probes the other end of a connection when the connection is otherwise idle, even when there is no data to be sent. The TCP specification does not include a keep-alive mechanism because it could: • cause perfectly good connections to break during transient Internet failures; • consume unnecessary bandwidth ('if no one is using the connection, who cares if it is still good?' ); and • cost money for an Internet path that charges for packets. Some TCP implementations, however, have included a keep-alive mechanism.
To confirm that an idle connection is still active, these implementations send a probe segment designed to elicit a response from the peer TCP (). By default, keep-alive is disabled unless a socket specifies SO_KEEPALIVE when it is created. The default idle interval must be no less than 2 hours, but can be configured in the operating system. Ensure that Domain Name Servers (DNS) are very responsive. Consider setting high Time To Live (TTL) values for hosts that are unlikely to change. If performance is very important or DNS response times have high variability, consider adding all major DNS lookups to each operating system's local DNS lookup file (e.g. One of the troubleshooting steps for slow response time issues is to sniff the network between all the network elements (e.g.
HTTP server, application server, database, etc.). The most popular tool for sniffing and analyzing network data is Wireshark which is covered in the. Common errors are frequent retransmission requests (sometimes due to a bug in the switch or bad cabling). 2603 Planes Of Chaos Pdf. We have seen increasing cases of antivirus leading to significant performance problems.
Companies are more likely to run quite intrusive antivirus even on critical, production machines. The antivirus settings are usually corporate-wide and may be inappropriate or insufficiently tuned for particular applications or workloads. In some cases, even when an antivirus administrator states that antivirus has been 'disabled,' there may still be kernel level modules that are still operational. In some cases, slowdowns are truly difficult to understand; for example, in one case a slowdown occurred because of a network issue communicating with the antivirus hub, but this occurred at a kernel-level driver in fully native code, so it was very difficult even to hypothesize that it was antivirus. You can use operating system level tools and sampling profilers to check for such cases, but they may not always be obvious. Keep a watch out for signs of antivirus and consider running a benchmark comparison with and without antivirus (completely disabled, perhaps even uninstalled).
Another class of products that are somewhat orthogonal are security products which provide integrity, security, and data scrubbing capabilities for sensitive data. For example, they will hook into the kernel so that any time a file is copied onto a USB key, a prompt will ask whether the information is confidential or not (and if so, perform encryption). This highlights the point that it is important to gather data on which kernel modules are active (e.g. Using CPU during the time of the problem). To ensure that all clocks are synchronized on all nodes use something like the Network Time Protocol (NTP). This helps with correlating diagnostics and it's required for certain functions in products.
Consider setting one standardized time zone for all nodes, regardless of their physical location. Some consider it easier to standardize on the UTC/GMT/Zulu time zone. POSIX, or Portable Operating System Interface for Unix, is the public standard for Unix-like operating systems, including things like APIs, commands, utilities, threading libraries, etc.
It is implemented in part or in full by: AIX, Linux, Solaris, z/OS USS, HP/UX, etc. One simple and very useful indicator of process health and load is its TCP activity. Here is a script that takes a set of ports and summarizes how many TCP sockets are established, opening, and closing for each port:. It has been tested on Linux and AIX. Example output: $ portstats.sh 80 443 PORT ESTABLISHED OPENING CLOSING 80 3 0 0 443 10 0 2 ==================================== Total 13 0 2 As environments continue to grow, automation becomes more important.
On POSIX operating systems, SSH keys may be used to automate running commands, gathering logs, etc. A 30 minute investment to configure SSH keys will save countless hours and mistakes. • Choose one of the machines that will be the orchestrator (or a Linux, Mac, or Windows cygwin machine) • Ensure the SSH key directory exists: $ cd ~/.ssh/ If this directory does not exist: $ mkdir ~/.ssh && chmod 700 ~/.ssh && cd ~/.ssh/ • Generate an SSH key: $ ssh-keygen -t rsa -b 4096 -f ~/.ssh/orchestrator If using Linux: • Run the following command for each machine: $ ssh-copy-id -i ~/.ssh/orchestrator user@host For other POSIX operating systems • Log in to each machine as a user that has access to all logs (e.g. Root): $ ssh user@host • Ensure the SSH key directory exists: $ cd ~/.ssh/ If this directory does not exist: $ mkdir ~/.ssh && chmod 700 ~/.ssh && cd ~/.ssh/ • If the file ~/.s sh/a utho rize d_ke ys does not exist: $ touch ~/.s sh/a utho rize d_ke ys && chmod 700 ~/.s sh/a utho rize d_ke y s • Append the public key from ~/.s sh/o rche stra tor.