Jump to: Complete Features | Complete Epics | Other Complete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Networking Definition of Planned
Epic Template descriptions and documentation
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Networking Definition of Planned
Epic Template descriptions and documentation
make sure we deliver a 1.30 kube-proxy standalone image
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
This Epic is here to track the rebase we need to do when kube 1.26 is GA https://www.kubernetes.dev/resources/release/
https://docs.google.com/document/d/1h1XsEt1Iug-W9JRheQas7YRsUJ_NQ8ghEMVmOZ4X-0s/edit --> this is the link for rebase help
Rebase Openshift SDN to use Kube 1.26
Migrate every occurrence of iptables in OpenShift to use nftables, instead.
Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)
Template:
Networking Definition of Planned
Epic Template descriptions and documentation
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
This should be the last SDN Kube rebase, but we need to work with the windows team to find a way for them to get the latest Kube-Proxy without depending on this rebase as SDN is deprecated
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
This Epic is here to track the rebase we need to do for kube 1.27, which is already out.
https://docs.google.com/document/d/1h1XsEt1Iug-W9JRheQas7YRsUJ_NQ8ghEMVmOZ4X-0s/edit --> this is the link for rebase help
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Please review the following PR: https://github.com/openshift/sdn/pull/574
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/sdn/pull/599
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The sdn image inherits from the cli image to get the oc binary. Change this to install the openshift-clients rpm instead.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy
Version-Release number of selected component (if applicable):
OpenShift 4.10.12
How reproducible:
Always
Steps to Reproduce:
1. Define deny all network policy for egress an ingress in a namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
2. Define the following network policy to allow the traffic between the pods in the namespace:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-intra-namespace-001 spec: egress: - to: - podSelector: {} ingress: - from: - podSelector: {} podSelector: {} policyTypes: - Ingress - Egress
3. Test the connectivity between two pods from the namespace.
Actual results:
The connectivity is not allowed
Expected results:
The connectivity should be allowed between pods from the same namespace.
Additional info:
After performing a test and analyzing SDN flows for the namespace:
sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]
We see that the following rule doesn't match because `reg1` hasn't been defined:
cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
The ovnver and ovsver args should be used even to infer to short versions of the RPMs to install in the sdn container images
Description of problem:
Observation from CISv1.4 pdf: 1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive “Container Network Interface provides various networking options for overlay networking. You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system.” To conform with CIS benchmarksChange, the /var/run/multus/cni/net.d/*.conf files on nodes should be updated to 600. $ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/var/run/multus/cni/net.d/*.conf"; done 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The file permissions of /var/run/multus/cni/net.d/*.conf on nodes is 644.
Expected results:
The file permissions of /var/run/multus/cni/net.d/*.conf on nodes should be updated to 600
Additional info:
Description of problem:
Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46
a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.
More description about issue is available in private comment below since it contains customer data.
In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.
The bug is 100% reproducible.
Steps for reproducing the issue are:
1. Install a cluster with OpenShiftSDN network plugin.
2. Configure egressip for a project.
3. Install NMstate operator.
4. Create a NodeNetworkConfigurationPolicy.
5. Identify on which node the egressIP is present.
6. Restart the nmstate-handler pod running on the identified node.
7. Verify that the egressIP is no more present.
Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.
This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.
The expectation is that NMstate operator should not interfere with SDN configuration.
There is capacity limit on egressIP for different cloud provider, for example, GCP, the limit is 10.
If the number of egressIP added to hostsubnet exceeds the capability limit, it is expected some logging message is emitted to event log, that can be seen through "oc get event"
On a GCP with SDN plugin, configure egressCIDRs on one worker node, configured 12 netnamespaces, each has 1 egressIP configured, the total number of egressIP for the hostsubnet has exceeded its capacity limit of 10. No event log was seen to indicate that the number of egressIP for the hostsubnet has exceeded the limit.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-08-02-014045 True False 160m Cluster version is 4.11.0-0.nightly-2022-08-02-014045
See attachment for more details.
Description of problem:
Description of problem:
Observation from CISv1.4 pdf: 1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive "Container Network Interface provides various networking options for overlay networking. You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system." To conform with CIS benchmarksChange, the /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600. $ for i in $(oc get pods -n openshift-sdn -l app=sdn -oname); do oc exec -n openshift-sdn $i -- find /var/lib/cni/networks/openshift-sdn -type f -exec stat -c %a {} \;; done Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods is 644
Expected results:
The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600
Additional info:
Please review the following PR: https://github.com/openshift/sdn/pull/600
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating services in a OVN-HybridOverlay cluster with Windows workers, we are experiencing intermittent reachability issues for the external-ip when the number of pods from the expose deployment is bigger than 1: [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE win-webserver LoadBalancer 172.30.38.192 34.136.170.199 80:30246/TCP 41m cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get deploy -n winc-38186 NAME READY UP-TO-DATE AVAILABLE AGE win-webserver 6/6 6 6 42m [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get pods -n winc-38186 NAME READY STATUS RESTARTS AGE win-webserver-597fb4c9cc-8ccwg 1/1 Running 0 6s win-webserver-597fb4c9cc-f54x5 1/1 Running 0 6s win-webserver-597fb4c9cc-jppxb 1/1 Running 0 97s win-webserver-597fb4c9cc-twn9b 1/1 Running 0 6s win-webserver-597fb4c9cc-x5rfr 1/1 Running 0 6s win-webserver-597fb4c9cc-z8sfv 1/1 Running 0 6s [cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199 curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out [cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199 curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out When having a look at the Load Balancer service, we can see that the externalTrafficPolicy is of type "Cluster": [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 win-webserver -o yaml apiVersion: v1 kind: Service metadata: creationTimestamp: "2022-11-25T13:29:00Z" finalizers: - service.kubernetes.io/load-balancer-cleanup labels: app: win-webserver name: win-webserver namespace: winc-38186 resourceVersion: "169364" uid: 4a229123-ee88-47b6-99ce-814522803ad8 spec: allocateLoadBalancerNodePorts: true clusterIP: 172.30.38.192 clusterIPs: - 172.30.38.192 externalTrafficPolicy: Cluster internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - nodePort: 30246 port: 80 protocol: TCP targetPort: 80 selector: app: win-webserver sessionAffinity: None type: LoadBalancer status: loadBalancer: ingress: - ip: 34.136.170.199 Recreating the Service setting externalTrafficPolicy to Local seems to solve the issue: $ oc describe svc win-webserver -n winc-38186 Name: win-webserver Namespace: winc-38186 Labels: app=win-webserver Annotations: <none> Selector: app=win-webserver Type: LoadBalancer IP Family Policy: SingleStack IP Families: IPv4 IP: 172.30.38.192 IPs: 172.30.38.192 LoadBalancer Ingress: 34.136.170.199 Port: <unset> 80/TCP TargetPort: 80/TCP NodePort: <unset> 30246/TCP Endpoints: 10.132.0.18:80,10.132.0.19:80,10.132.0.20:80 + 3 more... Session Affinity: None External Traffic Policy: Cluster Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ExternalTrafficPolicy 66m service-controller Cluster -> Local Normal EnsuringLoadBalancer 63m (x3 over 113m) service-controller Ensuring load balancer Normal ExternalTrafficPolicy 63m service-controller Local -> Cluster Normal EnsuredLoadBalancer 62m (x3 over 113m) service-controller Ensured load balancer $ oc get svc -n winc-test NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE linux-webserver LoadBalancer 172.30.175.95 34.136.11.87 8080:30715/TCP 152m win-check LoadBalancer 172.30.50.151 35.194.12.34 80:31725/TCP 4m33s win-webserver LoadBalancer 172.30.15.95 35.226.129.1 80:30409/TCP 152m [cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34 While the other service which has externalTrafficPolicy set to "Cluster" is still failing: [cloud-user@preserve-jfrancoa tmp]$ curl 35.226.129.1 curl: (7) Failed to connect to 35.226.129.1 port 80: Connection timed out
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-11-24-203151 True False 7h2m Cluster version is 4.12.0-0.nightly-2022-11-24-203151 $ oc get network cluster -o yaml apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2022-11-25T06:56:50Z" generation: 2 name: cluster resourceVersion: "2952" uid: e9ad729c-36a4-4e71-9a24-740352b11234 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 status: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 1360 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16
How reproducible:
Always, sometimes it takes more curl calls to the External IP, but it always ends up timeouting
Steps to Reproduce:
1. Deploy a Windows cluster with OVN-Hybrid overlay on GCP, the following Jenkins job can be used for it: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/158926/ 2. Create a deployment and a service, for example: kind: Service metadata: labels: app: win-check name: win-check namespace: winc-test spec: #externalTrafficPolicy: Local ports: - port: 80 targetPort: 80 selector: app: win-check type: LoadBalancer --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: win-check name: win-check namespace: winc-test spec: replicas: 6 selector: matchLabels: app: win-check template: metadata: labels: app: win-check name: win-check spec: containers: - command: - pwsh.exe - -command - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); }; image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022 name: win-check securityContext: runAsNonRoot: false windowsOptions: runAsUserName: ContainerAdministrator nodeSelector: kubernetes.io/os: windows tolerations: - key: os value: Windows 3.Get the external IP for the service: $ oc get svc -n winc-test NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE linux-webserver LoadBalancer 172.30.175.95 34.136.11.87 8080:30715/TCP 94m win-check LoadBalancer 172.30.82.251 35.239.175.209 80:30530/TCP 29s win-webserver LoadBalancer 172.30.15.95 35.226.129.1 80:30409/TCP 94m 4. Try to curl the external-ip: $ curl 35.239.175.209 curl: (7) Failed to connect to 35.239.175.209 port 80: Connection timed out
Actual results:
The Load Balancer IP is not reachable, thus impacting in the service availability
Expected results:
The Load Balancer IP is available at all times
Additional info:
https://github.com/openshift/cluster-network-operator/blob/830daae9472c1e3f525c0af66bc7ea4054de9989/bindata/network/openshift-sdn/sdn.yaml#L308
is executing the host's `oc` but in a container userspace. This breaks when we try to update RHCOS to RHEL9, but leave the SDN pods as rhel8.
This code looks likely better to directly write in Go instead of bash.
We need to rebase openshift-sdn to kube 1.25's kube-proxy.
In particular, we need this to get https://github.com/kubernetes/kubernetes/pull/110334 into master because we will probably get asked to backport it.
Description of problem:
During a highly escalated scenario, we have found the following scenario: - Due to an unrelated problem, 2 control plane nodes had "localhost.localdomain" hostname when their respective sdn-controller pods started (this problem would be out of the scope of this bug report). - As both sdn-controller pods had (and retained) the "localhost.localdomain" hostname, this caused both of them to use "localhost.localdomain" while trying to acquire and renew the controller lease in openshift-network-controller configmap. - This ultimately caused both sdn-controller pods to mistakenly believe that they were the active sdn-controller, so both of them were active at the same time. Such a situation might have a number of undesired (and unknown) side effects. In our case, the result was that two nodes were allocated the same hostsubnet, disrupting pod communication between the 2 nodes and with the other nodes. What we expect from this bug report: That the sdn-controller never tries to acquire a lease as "localhost.localdomain" during a failure scenario. The ideal solution would be to acquire the lease in a way that avoids collisions (more on this on comments), but at the very least, sdn-controller should prefer crash-looping rather than starting with a lease that can collide and wreak havoc.
Version-Release number of selected component (if applicable):
Found on 4.11, but it should be reproducible in 4.13 as well.
How reproducible:
Under some error scenarios where 2 control plane nodes temporarily have "localhost.localdomain" hostname by mistake.
Steps to Reproduce:
1. Start sdn-controller pods 2. 3.
Actual results:
2 sdn-controller pods acquire the lease with "localhost.localdomain" holderIdentity and become active at the same time.
Expected results:
No sdn-controller pod to acquire the lease with "localhost.localdomain" holderIdentity. Either use unique identities even when there is failure scenario or just crash-loop.
Additional info:
Just FYI, the trigger that caused the wrong domain was investigated at this other bug: https://issues.redhat.com/browse/OCPBUGS-11997 However, this situation may happen under other possible failure scenarios, so it is worth preventing it somehow.
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/72
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/sdn/pull/596
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/sdn/pull/623
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.
Version-Release number of selected component (if applicable):
4.11 (originally reproduced on 4.9)
How reproducible:
Always
Steps to Reproduce:
1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP. 2. 3.
Actual results:
Error, DNS resolution not completed.
Expected results:
Request retried via TCP and succeeded.
Additional info:
In comments.
Description of problem:
IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service.
Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected.
IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16
Without this functionality IBM ROKS is not able to GA OCP 4.17
Description of problem:
DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN. Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012 This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug. UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.
Version-Release number of selected component (if applicable):
4.13, 4.12, 4.11
How reproducible:
100%
Steps to Reproduce:
1. oc debug -n openshift-dns 2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind # Retry multiple times, and you should always get the same local DNS pod.
Actual results:
[gspence@gspence origin]$ oc debug -n openshift-dns Starting pod/image-debug ... Pod IP: 10.128.2.10 If you don't see a command prompt, try pressing enter. sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-gzlhm" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-dnbsp" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-gzlhm"
Expected results:
[gspence@gspence origin]$ oc debug -n openshift-dns Starting pod/image-debug ... Pod IP: 10.128.2.10 If you don't see a command prompt, try pressing enter. sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8"
Additional info:
https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working. iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing