12758nam 22005653 450 991076751710332120240524104128.01-4842-9490-410.1007/978-1-4842-9490-1(PPN)27359611X(MiAaPQ)EBC30979452(Au-PeEL)EBL30979452(OCoLC)1411849283(OCoLC-P)1411849283(CaSebORM)9781484294901(CKB)29127007700041(EXLCZ)992912700770004120231202d2023 uy 0engurcnu||||||||txtrdacontentcrdamediacrrdacarrierKafka Troubleshooting in Production Stabilizing Kafka Clusters in the Cloud and On-Premises1st ed.Berkeley, CA :Apress L. P.,2023.©2023.1 online resource (229 pages)Includes index.1-4842-9489-0 Intro -- Table of Contents -- About the Author -- About the Technical Reviewer -- Acknowledgments -- Introduction -- Chapter 1: Storage Usage in Kafka: Challenges, Strategies, and Best Practices -- How Kafka Runs Out of Disk Space -- A Retention Policy Can Cause Data Loss -- Configuring a Retention Policy for Kafka Topics -- Managing Consumer Lag and Preventing Data Loss -- Handling Bursty Data Influx from Producers -- Balancing Consumer Throttling and Avoiding Unintended Lag -- Understanding Daily Traffic Variations and Their Impact on Data Retention -- Ensuring Batch Duration Compliance with Topic Retention to Avoid Data Loss -- Adding Storage to Kafka Clusters -- When the Cluster Is On-Prem -- Scaling Up -- Scaling Out -- When the Cluster Is in the Cloud -- EBS Disks -- NVME Disks (Ephemeral Disks) -- Strategies and Considerations for Extended Retention in Kafka Clusters -- Calculating Storage Capacity Based on Time-Based Retention -- Retention Monitoring -- Data Skew in Partitions -- Message Rate Into Topic -- Don't Write to the / Mount Point -- Summary -- Chapter 2: Strategies for Aggregation, Data Cardinality, and Batching -- Balancing Message Distribution and Aggregation for Optimal Kafka Performance -- Tuning Parameters to Increase Throughput and Reduce Latency -- Optimizing Producer and Broker Performance: The Impact of Tuning linger.ms and batch.size in Kafka -- Understanding Compression Rate -- The Effect of Data Cardinality on Producers, Consumers, and Brokers -- Defining Data Cardinality -- Effects of High Data Cardinality -- Reducing Cardinality Level and Distribution -- Duplicating Data to Reduce Latency -- Summary -- Chapter 3: Understanding and Addressing Partition Skew in Kafka -- Skew of Partition Leaders vs. Skew of Partition Followers -- Potential Problems with Brokers That Host Many Partition Leaders.Message Rate (or Incoming Bytes Rate) -- Number of Consumers Consuming the Topic -- Number of Producers Producing to the Topic -- Follower (Replica) Skew in the Broker -- Number of In-Sync Replicas in the Broker -- Checking for an Imbalance of Partition Leaders -- Reassigning Partitions to Achieve an Even Distribution -- Data Distribution Among Disks -- Summary -- Chapter 4: Dealing with Skewed and Lost Leaders -- When Partitions Lose Their Leadership -- ZooKeeper -- The Network Interface Card (NIC) -- Should Leader Skew Always Be Solved? -- When There Is High Traffic -- When There Is a Large Number of Consumers/Producers -- Understanding Leader Skew -- Summary -- Chapter 5: CPU Saturation in Kafka: Causes, Consequences, and Solutions -- CPU Saturation -- CPU Usage Types -- Causes of High CPU User Times -- Causes of High CPU System Times -- Example of Kafka Brokers with High CPU %us and %sy -- Causes of High CPU Wait Times -- Causes of High CPU System Interrupt Times -- The Effect of Compacted Topics with High Retention on Disk and CPU Use -- What Is Log Compaction? -- Real Production Issues Due to Log Compaction -- The Number of Consumers per Topic vs. CPU Use -- Summary -- Chapter 6: RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies -- Adding RAM to a Kafka Cluster -- The Strategic Role of RAM Over CPU and Disks -- Cloud vs. On-Prem RAM Expansion: Considerations and Constraints -- Adding RAM to the Cloud -- Adding RAM to On-Prem Kafka Clusters -- Enhancing Kafka's Performance: The Benefits of Increasing Broker RAM -- Performance Boost -- Disk I/O Reduction -- Throughput Enhancement -- Latency Reduction -- Understanding the Linux Page Cache -- Page Cache in Kafka: Accelerating Writes and Reads -- Balancing Performance and Reliability: Kafka's Page Cache Utilization.Monitoring Page Cache Usage Using the Cachestat Tool -- Lack of RAM and its Effect on Disks -- Optimize Kafka Disks When the Cluster Lacks RAM -- Use SSDs Instead of HDDs -- Distribute Logs Across Disks -- Tune OS Disk Scheduling Algorithm -- Adjust Kafka's Disk Flush Policies -- Enable Log Compression -- Enable OS Page Cache -- Monitor Disk Usage and I/O -- A Lack of RAM Can Cause Disks to Reach IOPS Saturation -- Optimize Kafka in Terms of RAM Allocation -- Set vm.swappiness to the Minimum Possible Value -- Increase the File Descriptor Limits -- Increase the Limit of Memory-Mapped Files -- GIVE at Least 32GB RAM to Your Kafka Brokers -- Monitor Garbage Collection Times Closely -- Tuning JVM Options -- Using Appropriate Instance Types When Deploying on a Cloud Platform -- Balancing Topics and Partitions Across Brokers -- Dealing with Garbage Collection (GC) and Out-Of-Memory (OOM) -- Latency Spikes -- Resource Utilization -- System Stability -- Impact on ZooKeeper Heartbeat -- Measuring Kafka Memory Usage -- The Crucial Role of RAM: Lessons from a Non-Kafka Cluster -- Summary -- Chapter 7: Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges -- Disk Performance Metrics -- Detecting Whether Disks Cause Latency in Kafka Brokers, Consumers, or Producers -- How Kafka Reads and Writes to the Disks -- Writes -- Reads -- Disk Performance Detection -- Data Skew in the Scope of a Single Broker -- Data Skew in the Scope of a Kafka Cluster -- Consumer Lag from a Specific Broker -- Slow (Faulty) Disk -- Real Production Issue: Detecting a Faulty Broker Using Disk Performance Metrics -- Discussion -- The Effect of Too Many disk.io Threads -- Discussion -- Looking at Disk Performance the Whole Time vs. During Peak Time Only -- Discussion -- The Effect of disk.io Threads on Broker, Producer, and Consumer Performance -- Request Queue Size.Produce Latency -- Number of JVM Threads -- Number of Context Switches -- CPU User Time, System Time, and Normalized Load Average -- Discussion -- Summary -- Chapter 8: Disk Configuration: RAID 10 vs. JBOD -- RAID 10 and JBOD Terminology -- RAID 0 (aka Stripe Set) -- RAID 1 (aka Mirror Set) -- RAID 1+0 (aka RAID 10) -- Comparing RAID 10 and JBOD -- Disk Failure -- Data Skew -- Storage Use -- Pros and Cons of RAID 10 and JBOD -- Performance of Write Operations -- Storage Usage -- Disk Failure Tolerance -- Considering the Maintenance Burden of Disk Failure in On-Premises Clusters -- Disk Health Monitoring -- Frequency of Replacing Disks -- Kafka Availability During Disk Replacement -- Balancing the Data Between the Disks in the Broker -- JBOD -- RAID 10 -- Managing Disk Health in Kafka Clusters with JBOD Configuration -- Summary -- Chapter 9: A Deep Dive Into Producer Monitoring -- Producer Metrics -- Network I/O Rate Metric -- When Network I/O Rate Is High -- When Network I/O Rate Is Low -- Importance of the Network I/O Rate Metric -- Record Queue Time Metric -- When Record Queue Time Is High -- When Record Queue Time Is Low -- Mitigating a High Record Queue Time -- Importance of the Record Queue Time Metric -- Output Bytes Metric -- When Output Bytes Is High -- When Output Bytes Is Low -- Mitigating the Output Bytes Value -- Importance of the Output Bytes Metric -- Input Bytes Metric -- The Difference Between Output Bytes and Input Bytes -- Average Batch Size Metric -- When Average Batch Size Is High -- When Average Batch Size Is Low -- Mitigating the Average Batch Size Metric -- Importance of the Average Batch Size Metric -- Buffer Available Bytes Metric -- When Buffer Available Bytes Is High -- When Buffer Available Bytes Is Low -- Mitigating the Buffer Available Bytes Metric -- Importance of the Buffer Available Bytes Metric.Request Latency (Avg/Max) Metrics -- When Request Latency Is High -- When Request Latency Is Low -- Mitigating the Request Latency Metric -- Importance of the Request Latency Metric -- Understanding the Impact of Multiple Producers and Consumers on the Kafka Cluster -- Compression Rate: A Special Kind of Producer Metric -- Configuring Compression on the Producer and Broker Levels -- Compression Rate -- When Compression Rate Is High -- When Compression Rate Is Low -- Mitigating Compression Rate -- Importance of the Compression Rate Metric -- Summary -- Chapter 10: A Deep Dive Into Consumer Monitoring -- Consumer Metrics -- Consumer Lag Metrics -- When the Consumer Lag Metric Is High -- When the Consumer Lag Metric Is Low -- Mitigating the Consumer Lag Metric -- Importance of the Consumer Lag Metric -- Fetch Request Rate Metric -- When the Fetch Request Rate Is High -- When the Fetch Request Rate Is Low -- Mitigating the Fetch Request Rate -- Importance of the Fetch Request Rate -- Fetch Request Size (Avg/Max) Metrics -- When the Fetch Request Size Is High -- When the Fetch Request Size Is Low -- Mitigating the Fetch Request Size -- Importance of the Fetch Request Size Metrics -- Consumer I/O Wait Ratio Metric -- When the Consumer I/O Wait Ratio Is High -- When the Consumer I/O Wait Ratio Is Low -- Mitigating the Consumer I/O Wait Ratio -- Importance of the Consumer I/O Wait Ratio -- Records per Request Avg Metric -- When the Records per Request Metric Is High -- When the Records per Request Metric Is Low -- Mitigating the Records per Request Metric -- Importance of the Records per Request Metric -- Fetch Latency Avg/Max Metrics -- When the Fetch Latency Metrics Are High -- When the Fetch Latency Metrics Are Low -- Mitigating the Fetch Latency Metrics -- Importance of the Fetch Latency Metrics -- Consumer Request Rate Metric.When the Consumer Request Rate Metric Is High.This book provides Kafka administrators, site reliability engineers, and DataOps and DevOps practitioners with a list of real production issues that can occur in Kafka clusters and how to solve them. The production issues covered are assembled into a comprehensive troubleshooting guide for those engineers who are responsible for the stability and performance of Kafka clusters in production, whether those clusters are deployed in the cloud or on-premises. This book teaches you how to detect and troubleshoot the issues, and eventually how to prevent them. Kafka stability is hard to achieve, especially in high throughput environments, and the purpose of this book is not only to make troubleshooting easier, but also to prevent production issues from occurring in the first place. The guidance in this book is drawn from the author's years of experience in helping clients and internal customers diagnose and resolve knotty production problems and stabilize their Kafka environments. The book is organized into recipe-style troubleshooting checklists that field engineers can easily follow when under pressure to fix an unstable cluster. This is the book you will want by your side when the stakes are high, and your job is on the line. You will: Monitor and resolve production issues in your Kafka clusters Provision Kafka clusters with the lowest costs and still handle the required loads Perform root cause analyses of issues affecting your Kafka clusters Know the ways in which your Kafka cluster can affect its consumers and producers Prevent or minimize data loss and delays in data streaming Forestall production issues through an understanding of common failure points Create checklists for troubleshooting your Kafka clusters when problems occur.Big dataCloud computingBig data.Cloud computing.005.713Eldor Elad1453213MiAaPQMiAaPQMiAaPQBOOK9910767517103321Kafka Troubleshooting in Production3655731UNINA