Feature Based Change Log

Bug OCPBUGS-69: No event log was emitted when egressIP exceeds capacity limit for cloud providers with SDN plugin

View the Description View the linked PRs

There is capacity limit on egressIP for different cloud provider, for example, GCP, the limit is 10.

If the number of egressIP added to hostsubnet exceeds the capability limit, it is expected some logging message is emitted to event log, that can be seen through "oc get event"

On a GCP with SDN plugin, configure egressCIDRs on one worker node, configured 12 netnamespaces, each has 1 egressIP configured, the total number of egressIP for the hostsubnet has exceeded its capacity limit of 10. No event log was seen to indicate that the number of egressIP for the hostsubnet has exceeded the limit.

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-08-02-014045 True False 160m Cluster version is 4.11.0-0.nightly-2022-08-02-014045

See attachment for more details.

https://github.com/openshift/sdn/pull/470

Bug OCPBUGS-22077: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/585

Bug OCPBUGS-4133: Load Balance service with externalTrafficPolicy="Cluster" for Windows workloads intermittently unavailable in GCP and Azure

View the Description View the linked PRs

Description of problem:

When creating services in a OVN-HybridOverlay cluster with Windows workers, we are experiencing intermittent reachability issues for the external-ip when the number of pods from the expose deployment is bigger than 1:

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 
NAME            TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)        AGE
win-webserver   LoadBalancer   172.30.38.192   34.136.170.199   80:30246/TCP   41m

cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get deploy -n winc-38186 
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
win-webserver   6/6     6            6           42m

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get pods -n winc-38186 
NAME                             READY   STATUS    RESTARTS   AGE
win-webserver-597fb4c9cc-8ccwg   1/1     Running   0          6s
win-webserver-597fb4c9cc-f54x5   1/1     Running   0          6s
win-webserver-597fb4c9cc-jppxb   1/1     Running   0          97s
win-webserver-597fb4c9cc-twn9b   1/1     Running   0          6s
win-webserver-597fb4c9cc-x5rfr   1/1     Running   0          6s
win-webserver-597fb4c9cc-z8sfv   1/1     Running   0          6s

[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out
[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out

When having a look at the Load Balancer service, we can see that the externalTrafficPolicy is of type "Cluster":

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 win-webserver -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-11-25T13:29:00Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app: win-webserver
  name: win-webserver
  namespace: winc-38186
  resourceVersion: "169364"
  uid: 4a229123-ee88-47b6-99ce-814522803ad8
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.30.38.192
  clusterIPs:
  - 172.30.38.192
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - nodePort: 30246
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: win-webserver
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 34.136.170.199


Recreating the Service setting externalTrafficPolicy to Local seems to solve the issue:  $ oc describe svc win-webserver -n winc-38186
Name:                     win-webserver
Namespace:                winc-38186
Labels:                   app=win-webserver
Annotations:              <none>
Selector:                 app=win-webserver
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.30.38.192
IPs:                      172.30.38.192
LoadBalancer Ingress:     34.136.170.199
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  30246/TCP
Endpoints:                10.132.0.18:80,10.132.0.19:80,10.132.0.20:80 + 3 more...
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                 Age                 From                Message
  ----    ------                 ----                ----                -------
  Normal  ExternalTrafficPolicy  66m                 service-controller  Cluster -> Local
  Normal  EnsuringLoadBalancer   63m (x3 over 113m)  service-controller  Ensuring load balancer
  Normal  ExternalTrafficPolicy  63m                 service-controller  Local -> Cluster
  Normal  EnsuredLoadBalancer    62m (x3 over 113m)  service-controller  Ensured load balancer 

$ oc get svc -n winc-test
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)          AGE
linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87   8080:30715/TCP   152m
win-check         LoadBalancer   172.30.50.151   35.194.12.34   80:31725/TCP     4m33s
win-webserver     LoadBalancer   172.30.15.95    35.226.129.1   80:30409/TCP     152m
[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34

While the other service which has externalTrafficPolicy set to "Cluster" is still failing:

[cloud-user@preserve-jfrancoa tmp]$ curl 35.226.129.1
curl: (7) Failed to connect to 35.226.129.1 port 80: Connection timed out

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-11-24-203151   True        False         7h2m    Cluster version is 4.12.0-0.nightly-2022-11-24-203151


$ oc get network cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2022-11-25T06:56:50Z"
  generation: 2
  name: cluster
  resourceVersion: "2952"
  uid: e9ad729c-36a4-4e71-9a24-740352b11234
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  externalIP:
    policy: {}
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
status:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  clusterNetworkMTU: 1360
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16

How reproducible:

Always, sometimes it takes more curl calls to the External IP, but it always ends up timeouting

Steps to Reproduce:

1. Deploy a Windows cluster with OVN-Hybrid overlay on GCP, the following Jenkins job can be used for it: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/158926/
2. Create a deployment and a service, for example:
kind: Service
metadata:
  labels:
    app: win-check
  name: win-check
  namespace: winc-test
spec:
  #externalTrafficPolicy: Local
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: win-check
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-check
  name: win-check
  namespace: winc-test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: win-check
  template:
    metadata:
      labels:
        app: win-check
      name: win-check
    spec:
      containers:
      - command:
        - pwsh.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/');
          $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening)
          { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows
          Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content);
          $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer,
          0, $buffer.Length); $response.Close(); };
        image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022
        name: win-check
        securityContext:
          runAsNonRoot: false
          windowsOptions:
            runAsUserName: ContainerAdministrator
      nodeSelector:
        kubernetes.io/os: windows
      tolerations:
      - key: os
        value: Windows
  3.Get the external IP for the service: 
$ oc get svc -n winc-test                                                   
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE                                            
linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87     8080:30715/TCP   94m                                            
win-check         LoadBalancer   172.30.82.251   35.239.175.209   80:30530/TCP     29s                                            
win-webserver     LoadBalancer   172.30.15.95    35.226.129.1     80:30409/TCP     94m

  4. Try to curl the external-ip:
$ curl 35.239.175.209
curl: (7) Failed to connect to 35.239.175.209 port 80: Connection timed out

Actual results:

The Load Balancer IP is not reachable, thus impacting in the service availability

Expected results:

The Load Balancer IP is available at all times

Additional info:

https://github.com/openshift/sdn/pull/498

Bug OCPBUGS-5842: executes /host/usr/bin/oc

View the Description View the linked PRs

https://github.com/openshift/cluster-network-operator/blob/830daae9472c1e3f525c0af66bc7ea4054de9989/bindata/network/openshift-sdn/sdn.yaml#L308
is executing the host's `oc` but in a container userspace. This breaks when we try to update RHCOS to RHEL9, but leave the SDN pods as rhel8.

This code looks likely better to directly write in Go instead of bash.

https://github.com/openshift/sdn/pull/495

Bug OCPBUGS-12233: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/535

Bug OCPBUGS-12435: EgressNetworkPolicy DNS resolution does not fall back to TCP for truncated responses

View the Description View the linked PRs

Description of problem:

If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.

Version-Release number of selected component (if applicable):

4.11 (originally reproduced on 4.9)

How reproducible:

Always

Steps to Reproduce:

1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP.
2.
3.

Actual results:

Error, DNS resolution not completed.

Expected results:

Request retried via TCP and succeeded.

Additional info:

In comments.

https://github.com/openshift/sdn/pull/532

Bug OCPBUGS-256: intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

View the Description View the linked PRs

Description of problem:

intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

Version-Release number of selected component (if applicable):

OpenShift 4.10.12

How reproducible:

Always

Steps to Reproduce:
1. Define deny all network policy for egress an ingress in a namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

2. Define the following network policy to allow the traffic between the pods in the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace-001
spec:
  egress:
  - to:
    - podSelector: {}
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

3. Test the connectivity between two pods from the namespace.

Actual results:

The connectivity is not allowed

Expected results:

The connectivity should be allowed between pods from the same namespace.

Additional info:

After performing a test and analyzing SDN flows for the namespace:

sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 
 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]

We see that the following rule doesn't match because `reg1` hasn't been defined:

 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30

https://github.com/openshift/sdn/pull/459

Bug OCPBUGS-6714: Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

View the Description View the linked PRs

Description of problem:

Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.

More description about issue is available in private comment below since it contains customer data.

https://github.com/openshift/sdn/pull/503

Bug OCPBUGS-25779: Update 4.16 kube-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/600

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/600

Bug OCPBUGS-95: NMstate removes egressip in OpenShift cluster with SDN plugin

View the Description View the linked PRs

In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.

The bug is 100% reproducible.

Steps for reproducing the issue are:

1. Install a cluster with OpenShiftSDN network plugin.

2. Configure egressip for a project.

3. Install NMstate operator.

4. Create a NodeNetworkConfigurationPolicy.

5. Identify on which node the egressIP is present.

6. Restart the nmstate-handler pod running on the identified node.

7. Verify that the egressIP is no more present.

Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.

This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.

The expectation is that NMstate operator should not interfere with SDN configuration.

Bug OCPBUGS-1533: sdn rebase to 1.25

View the Description View the linked PRs

We need to rebase openshift-sdn to kube 1.25's kube-proxy.

In particular, we need this to get https://github.com/kubernetes/kubernetes/pull/110334 into master because we will probably get asked to backport it.

https://github.com/openshift/sdn/pull/458

Bug OCPBUGS-25015: Update 4.16 kube-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/596

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/596

Bug OCPBUGS-25036: sdn: Decouple from cli image

View the Description View the linked PRs

Description of problem:

The sdn image inherits from the cli image to get the oc binary. Change this to install the openshift-clients rpm instead.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.

2.

3.

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

https://github.com/openshift/sdn/pull/593

Bug OCPBUGS-25740: Update 4.16 ose-node-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/599

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/599

Bug OCPBUGS-19143: Update 4.15 kube-rbac-proxy image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/72

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/575

Bug OCPBUGS-26084: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/622

Bug OCPBUGS-34279: ART requests updates to 4.17 image kube-proxy-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/623

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-8007: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/515

Bug OCPBUGS-9985: TCP DNS Local Preference is not working for Openshift SDN

View the Description View the linked PRs

Description of problem:

DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN.

Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012

This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug.

UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.

Version-Release number of selected component (if applicable):

4.13, 4.12, 4.11

How reproducible:

100%

Steps to Reproduce:

1. oc debug -n openshift-dns
2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind
# Retry multiple times, and you should always get the same local DNS pod.

Actual results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-dnbsp"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"

Expected results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"

Additional info:

https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working.

iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing

https://github.com/openshift/sdn/pull/518

Bug OCPBUGS-13975: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/546

Bug OCPBUGS-23666: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/604

Bug OCPBUGS-3176: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/549

Bug OCPBUGS-43764: Local endpoint for the DNS service not working in OpenShift 4.17 with 3rd party CNI

View the Description View the linked PRs

Description of problem:

IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service.

Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected.

IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16

https://github.com/openshift/sdn/blob/release-4.16/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L979-L992

Without this functionality IBM ROKS is not able to GA OCP 4.17

https://github.com/openshift/sdn/pull/638

Bug OCPBUGS-13717: The ovnver and ovsver args should be used even to infer to short versions of the RPMs to install in the sdn container images

View the Description View the linked PRs

The ovnver and ovsver args should be used even to infer to short versions of the RPMs to install in the sdn container images

https://github.com/openshift/sdn/pull/534

Bug OCPBUGS-18785: sdn-controller should never try to a lease as "localhost.localdomain"

View the Description View the linked PRs

Description of problem:

During a highly escalated scenario, we have found the following scenario:
- Due to an unrelated problem, 2 control plane nodes had "localhost.localdomain" hostname when their respective sdn-controller pods started (this problem would be out of the scope of this bug report).
- As both sdn-controller pods had (and retained) the "localhost.localdomain" hostname, this caused both of them to use "localhost.localdomain" while trying to acquire and renew the controller lease in openshift-network-controller configmap.
- This ultimately caused both sdn-controller pods to mistakenly believe that they were the active sdn-controller, so both of them were active at the same time.

Such a situation might have a number of undesired (and unknown) side effects. In our case, the result was that two nodes were allocated the same hostsubnet, disrupting pod communication between the 2 nodes and with the other nodes.

What we expect from this bug report: That the sdn-controller never tries to acquire a lease as "localhost.localdomain" during a failure scenario. The ideal solution would be to acquire the lease in a way that avoids collisions (more on this on comments), but at the very least, sdn-controller should prefer crash-looping rather than starting with a lease that can collide and wreak havoc.

Version-Release number of selected component (if applicable):

Found on 4.11, but it should be reproducible in 4.13 as well.

How reproducible:

Under some error scenarios where 2 control plane nodes temporarily have "localhost.localdomain" hostname by mistake.

Steps to Reproduce:

1. Start sdn-controller pods
2.
3.

Actual results:

2 sdn-controller pods acquire the lease with "localhost.localdomain" holderIdentity and become active at the same time.

Expected results:

No sdn-controller pod to acquire the lease with "localhost.localdomain" holderIdentity. Either use unique identities even when there is failure scenario or just crash-loop.

Additional info:

Just FYI, the trigger that caused the wrong domain was investigated at this other bug: https://issues.redhat.com/browse/OCPBUGS-11997

However, this situation may happen under other possible failure scenarios, so it is worth preventing it somehow.

https://github.com/openshift/sdn/pull/578

Bug OCPBUGS-30431: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/620

Bug OCPBUGS-12644: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/538

Bug OCPBUGS-16790: The file permissions of /var/run/multus/cni/net.d/*.conf on nodes should be updated to 600 to conform with CIS benchmarks

View the Description View the linked PRs

Description of problem:

Observation from CISv1.4 pdf:
1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive
“Container Network Interface provides various networking options for overlay networking.
You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system.”
 
To conform with CIS benchmarksChange, the /var/run/multus/cni/net.d/*.conf files on nodes should be updated to 600.

$ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/var/run/multus/cni/net.d/*.conf"; done
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

The file permissions of /var/run/multus/cni/net.d/*.conf on nodes is 644.

Expected results:

The file permissions of /var/run/multus/cni/net.d/*.conf on nodes should be updated to 600

Additional info:

https://github.com/openshift/sdn/pull/570

Bug OCPBUGS-17316: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/571

Bug OCPBUGS-19103: Update 4.15 ose-sdn image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/574

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/574

Bug OCPBUGS-1370: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/525

Bug OCPBUGS-16788: The file permissions of /var/lib/cni/networks/openshift-sdn in all sdn pods should be updated to 600 to conform with CIS benchmarks

View the Description View the linked PRs

Description of problem:

Observation from CISv1.4 pdf:
1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive

"Container Network Interface provides various networking options for overlay networking.
You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system."
 
To conform with CIS benchmarksChange, the /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600.
$ for i in $(oc get pods -n openshift-sdn -l app=sdn -oname); do oc exec -n openshift-sdn $i -- find /var/lib/cni/networks/openshift-sdn -type f -exec stat -c %a {} \;; done
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods is 644

Expected results:

The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600

Additional info:

https://github.com/openshift/sdn/pull/584

4.19.0-0.nightly-ppc64le-2024-11-23-164910

Changes from 4.18.0-rc.0

Complete Features

Template:

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Template:

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Acceptance Criteria

Dependencies (internal and external)

Complete Epics

Other Complete