Back to index

4.19.0-0.nightly-ppc64le-2024-11-24-150812

Jump to: Complete Features | Complete Epics | Other Complete |

Changes from 4.18.0-rc.0

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

make sure we deliver a 1.30 kube-proxy standalone image

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

Migrate every occurrence of iptables in OpenShift to use nftables, instead.

Goals (aka. expected user outcomes)

Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)

Requirements (aka. Acceptance Criteria):

  • Discover what components are using iptables (directly or indirectly, e.g. via ipfailover) and reduce the “unknown unknowns”.
  • Port components away from iptables.

Use Cases (Optional):

Questions to Answer (Optional):

  • Do we need a better “warning: you are using iptables” warning for customers? (eg, per-container rather than per-node, which always fires because OCP itself is using iptables). This could help provide improved visibility of the issue to other components that aren't sure if they need to take action and migrate to nftables, as well.

Out of Scope

  • Non-OVN primary CNI plug-in solutions

Background

Customer Considerations

  • What happens to clusters that don't migrate all iptables use to nftables?
    • In RHEL 9.x it will generate a single log message during node startup on every OpenShift node. There are Insights rules that will trigger on all OpenShift nodes.
    • In RHEL 10 iptables will just no longer work at all. Neither the command-line tools nor the kernel modules will be present.

Documentation Considerations

Interoperability Considerations

Template:

 

Networking Definition of Planned

Epic Template descriptions and documentation 

 

Epic Goal

  • OCP needs to detect when customer workloads are making use of iptables, and present this information to the customer (e.g. via alerts, metrics, insights, etc)
  • The RHEL 9 kernel logs a warning if iptables is used at any point anywhere in the system, but this is not helpful because OCP itself still uses iptables, so the warning is always logged.
  • We need to avoid false positives due to OCP's own use of iptables in pod namespaces (e.g. the rules to block access to the MCS). Porting those rules to nftables sooner rather than later is one solution.

Why is this important?

  • iptables will not exist in RHEL 10, so if customers are depending on it, they need to be warned.
  • Contrariwise, we are getting questions from customers who are not using iptables in their own workload containers, who are confused about the kernel warning. Clearer messaging should help reduce confusion here.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

 

This should be the last SDN Kube rebase, but we need to work with the windows team to find a way for them to get the latest Kube-Proxy without depending on this rebase as SDN is deprecated

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Please review the following PR: https://github.com/openshift/sdn/pull/574

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/sdn/pull/599

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The sdn image inherits from the cli image to get the oc binary. Change this to install the openshift-clients rpm instead.
 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.

2.

3.

 

Actual results:

 

Expected results:

 

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
  • Don't presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
  • Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

Description of problem:

  intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

Version-Release number of selected component (if applicable):

  OpenShift 4.10.12

How reproducible:

Always

Steps to Reproduce:
  1. Define deny all network policy for egress an ingress in a namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

2. Define the following network policy to allow the traffic between the pods in the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace-001
spec:
  egress:
  - to:
    - podSelector: {}
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress 

3. Test the connectivity between two pods from the namespace.

Actual results:

   The connectivity is not allowed

Expected results:

  The connectivity should be allowed between pods from the same namespace.

Additional info:

  After performing a test and analyzing SDN flows for the namespace: 

sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 
 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]

We see that the following rule doesn't match because `reg1` hasn't been defined:

 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 

 

Description of problem:

 

Observation from CISv1.4 pdf:
1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive
“Container Network Interface provides various networking options for overlay networking.
You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system.”
 
To conform with CIS benchmarksChange, the /var/run/multus/cni/net.d/*.conf files on nodes should be updated to 600.

$ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/var/run/multus/cni/net.d/*.conf"; done
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
644 /host/var/run/multus/cni/net.d/80-openshift-network.conf

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

The file permissions of /var/run/multus/cni/net.d/*.conf on nodes is 644.

Expected results:

The file permissions of /var/run/multus/cni/net.d/*.conf on nodes should be updated to 600

Additional info:

 

Description of problem:

Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.

More description about issue is available in private comment below since it contains customer data.

In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.

 

The bug is 100% reproducible.

Steps for reproducing the issue are:

1. Install a cluster with OpenShiftSDN network plugin.

2. Configure egressip for a project.

3. Install NMstate operator.

4. Create a NodeNetworkConfigurationPolicy.

5. Identify on which node the egressIP is present.

6. Restart the nmstate-handler pod running on the identified node.

7. Verify that the egressIP is no more present.

Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.

This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.

The expectation is that NMstate operator should not interfere with SDN configuration.

There is capacity limit on egressIP for different cloud provider, for example, GCP, the limit is 10.

If the number of egressIP added to hostsubnet exceeds the capability limit, it is expected some logging message is emitted to event log, that can be seen through "oc get event"

 

On a GCP with SDN plugin, configure egressCIDRs on one worker node, configured 12 netnamespaces, each has 1 egressIP configured, the total number of egressIP for the hostsubnet has exceeded its capacity limit of 10.   No event log was seen to indicate that the number of egressIP for the hostsubnet has exceeded the limit.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-08-02-014045   True        False         160m    Cluster version is 4.11.0-0.nightly-2022-08-02-014045

 

See attachment for more details.

 

Description of problem:

Description of problem:

Observation from CISv1.4 pdf:
1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive

"Container Network Interface provides various networking options for overlay networking.
You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system."
 
To conform with CIS benchmarksChange, the /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600.
$ for i in $(oc get pods -n openshift-sdn -l app=sdn -oname); do oc exec -n openshift-sdn $i -- find /var/lib/cni/networks/openshift-sdn -type f -exec stat -c %a {} \;; done
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644
644

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods is 644

Expected results:

The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600

Additional info:

 

Please review the following PR: https://github.com/openshift/sdn/pull/600

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When creating services in a OVN-HybridOverlay cluster with Windows workers, we are experiencing intermittent reachability issues for the external-ip when the number of pods from the expose deployment is bigger than 1:

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 
NAME            TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)        AGE
win-webserver   LoadBalancer   172.30.38.192   34.136.170.199   80:30246/TCP   41m

cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get deploy -n winc-38186 
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
win-webserver   6/6     6            6           42m

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get pods -n winc-38186 
NAME                             READY   STATUS    RESTARTS   AGE
win-webserver-597fb4c9cc-8ccwg   1/1     Running   0          6s
win-webserver-597fb4c9cc-f54x5   1/1     Running   0          6s
win-webserver-597fb4c9cc-jppxb   1/1     Running   0          97s
win-webserver-597fb4c9cc-twn9b   1/1     Running   0          6s
win-webserver-597fb4c9cc-x5rfr   1/1     Running   0          6s
win-webserver-597fb4c9cc-z8sfv   1/1     Running   0          6s

[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out
[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out

When having a look at the Load Balancer service, we can see that the externalTrafficPolicy is of type "Cluster":

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 win-webserver -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-11-25T13:29:00Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app: win-webserver
  name: win-webserver
  namespace: winc-38186
  resourceVersion: "169364"
  uid: 4a229123-ee88-47b6-99ce-814522803ad8
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.30.38.192
  clusterIPs:
  - 172.30.38.192
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - nodePort: 30246
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: win-webserver
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 34.136.170.199


Recreating the Service setting externalTrafficPolicy to Local seems to solve the issue:  $ oc describe svc win-webserver -n winc-38186
Name:                     win-webserver
Namespace:                winc-38186
Labels:                   app=win-webserver
Annotations:              <none>
Selector:                 app=win-webserver
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.30.38.192
IPs:                      172.30.38.192
LoadBalancer Ingress:     34.136.170.199
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  30246/TCP
Endpoints:                10.132.0.18:80,10.132.0.19:80,10.132.0.20:80 + 3 more...
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                 Age                 From                Message
  ----    ------                 ----                ----                -------
  Normal  ExternalTrafficPolicy  66m                 service-controller  Cluster -> Local
  Normal  EnsuringLoadBalancer   63m (x3 over 113m)  service-controller  Ensuring load balancer
  Normal  ExternalTrafficPolicy  63m                 service-controller  Local -> Cluster
  Normal  EnsuredLoadBalancer    62m (x3 over 113m)  service-controller  Ensured load balancer 

$ oc get svc -n winc-test
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)          AGE
linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87   8080:30715/TCP   152m
win-check         LoadBalancer   172.30.50.151   35.194.12.34   80:31725/TCP     4m33s
win-webserver     LoadBalancer   172.30.15.95    35.226.129.1   80:30409/TCP     152m
[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34

While the other service which has externalTrafficPolicy set to "Cluster" is still failing:

[cloud-user@preserve-jfrancoa tmp]$ curl 35.226.129.1
curl: (7) Failed to connect to 35.226.129.1 port 80: Connection timed out

 

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-11-24-203151   True        False         7h2m    Cluster version is 4.12.0-0.nightly-2022-11-24-203151


$ oc get network cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2022-11-25T06:56:50Z"
  generation: 2
  name: cluster
  resourceVersion: "2952"
  uid: e9ad729c-36a4-4e71-9a24-740352b11234
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  externalIP:
    policy: {}
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
status:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  clusterNetworkMTU: 1360
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16

How reproducible:

Always, sometimes it takes more curl calls to the External IP, but it always ends up timeouting

Steps to Reproduce:

1. Deploy a Windows cluster with OVN-Hybrid overlay on GCP, the following Jenkins job can be used for it: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/158926/
2. Create a deployment and a service, for example:
kind: Service
metadata:
  labels:
    app: win-check
  name: win-check
  namespace: winc-test
spec:
  #externalTrafficPolicy: Local
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: win-check
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-check
  name: win-check
  namespace: winc-test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: win-check
  template:
    metadata:
      labels:
        app: win-check
      name: win-check
    spec:
      containers:
      - command:
        - pwsh.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/');
          $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening)
          { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows
          Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content);
          $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer,
          0, $buffer.Length); $response.Close(); };
        image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022
        name: win-check
        securityContext:
          runAsNonRoot: false
          windowsOptions:
            runAsUserName: ContainerAdministrator
      nodeSelector:
        kubernetes.io/os: windows
      tolerations:
      - key: os
        value: Windows
  3.Get the external IP for the service: 
$ oc get svc -n winc-test                                                   
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE                                            
linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87     8080:30715/TCP   94m                                            
win-check         LoadBalancer   172.30.82.251   35.239.175.209   80:30530/TCP     29s                                            
win-webserver     LoadBalancer   172.30.15.95    35.226.129.1     80:30409/TCP     94m

  4. Try to curl the external-ip:
$ curl 35.239.175.209
curl: (7) Failed to connect to 35.239.175.209 port 80: Connection timed out

 

Actual results:

The Load Balancer IP is not reachable, thus impacting in the service availability

Expected results:

The Load Balancer IP is available at all times

Additional info:

 

https://github.com/openshift/cluster-network-operator/blob/830daae9472c1e3f525c0af66bc7ea4054de9989/bindata/network/openshift-sdn/sdn.yaml#L308
is executing the host's `oc` but in a container userspace. This breaks when we try to update RHCOS to RHEL9, but leave the SDN pods as rhel8.

This code looks likely better to directly write in Go instead of bash.

We need to rebase openshift-sdn to kube 1.25's kube-proxy.

In particular, we need this to get https://github.com/kubernetes/kubernetes/pull/110334 into master because we will probably get asked to backport it.

Description of problem:

During a highly escalated scenario, we have found the following scenario:
- Due to an unrelated problem, 2 control plane nodes had "localhost.localdomain" hostname when their respective sdn-controller pods started (this problem would be out of the scope of this bug report).
- As both sdn-controller pods had (and retained) the "localhost.localdomain" hostname, this caused both of them to use "localhost.localdomain" while trying to acquire and renew the controller lease in openshift-network-controller configmap.
- This ultimately caused both sdn-controller pods to mistakenly believe that they were the active sdn-controller, so both of them were active at the same time.

Such a situation might have a number of undesired (and unknown) side effects. In our case, the result was that two nodes were allocated the same hostsubnet, disrupting pod communication between the 2 nodes and with the other nodes.

What we expect from this bug report: That the sdn-controller never tries to acquire a lease as "localhost.localdomain" during a failure scenario. The ideal solution would be to acquire the lease in a way that avoids collisions (more on this on comments), but at the very least, sdn-controller should prefer crash-looping rather than starting with a lease that can collide and wreak havoc.

Version-Release number of selected component (if applicable):

Found on 4.11, but it should be reproducible in 4.13 as well.

How reproducible:

Under some error scenarios where 2 control plane nodes temporarily have "localhost.localdomain" hostname by mistake.

Steps to Reproduce:

1. Start sdn-controller pods
2.
3.

Actual results:

2 sdn-controller pods acquire the lease with "localhost.localdomain" holderIdentity and become active at the same time.

Expected results:

No sdn-controller pod to acquire the lease with "localhost.localdomain" holderIdentity. Either use unique identities even when there is failure scenario or just crash-loop.

Additional info:

Just FYI, the trigger that caused the wrong domain was investigated at this other bug: https://issues.redhat.com/browse/OCPBUGS-11997

However, this situation may happen under other possible failure scenarios, so it is worth preventing it somehow.

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/72

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/sdn/pull/596

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/sdn/pull/623

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.

Version-Release number of selected component (if applicable):

4.11 (originally reproduced on 4.9)

How reproducible:

Always

Steps to Reproduce:

1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP.
2.
3.

Actual results:

Error, DNS resolution not completed.

Expected results:

Request retried via TCP and succeeded.

Additional info:

In comments.

Description of problem:

IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service. 

Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected. 

IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16

https://github.com/openshift/sdn/blob/release-4.16/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L979-L992

Without this functionality IBM ROKS is not able to GA OCP 4.17

Description of problem:

DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN.

Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012

This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug.

UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.

Version-Release number of selected component (if applicable):

4.13, 4.12, 4.11

How reproducible:

100%

Steps to Reproduce:

1. oc debug -n openshift-dns
2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind
# Retry multiple times, and you should always get the same local DNS pod.

Actual results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-dnbsp"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"

Expected results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8" 

Additional info:

https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working.

iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing