Documentation for troubleshooting and debugging known issues in Shuffle.
Please check the Debugging section in the Configuration documentation
MFA can be enabled for your account on the settings User page of an organization, or on your settings page. If you have lost access to your account due to this however, follow these steps:
Cloud (shuffler.io): Send an email to support@shuffler.io using the email you want MFA removed for. Onprem: It's a bit more tricky onprem, as we'll need to modify the Opensearch database. Here is how:
550e8400-e29b-41d4-a716-446655440000
. docker logs shuffle-backend
docker exec -it shuffle-opensearch bash
curl -k -u admin:admin -H 'Content-Type: application/json' 'https://localhost:9200/users/_update/USERID' -d '{"doc": {"mfa_info.active": false, "mfa_info.active_code": "", "mfa_info.previous_code": ""}}'
Restart the backend server: docker restart shuffle-backend
. This is to fix potential caching problems.
Due to the nature of Shuffle at scale, there are bound to be network issues. As Shuffle runs in Docker, and sometimes in swarm with k8s networking, it complicates the matter even further. Here's a list of things to help with debugging networking. If all else fails; reboot the machine & docker.
sysctl -p
. This allows network cards to talk to each other on the same machine.{
"dns": ["10.0.0.2", "8.8.8.8"]
}
"Fixes" (in order):
In certain cases there may be DNS issues, leading to hanging executions. This is in most cases due to apps not being able to find the backend in some way. That's why the best solution if possible is to use the IP as hostname for Orborus -> Backend communication.
In certain cases, you may have an issue loading apps into Shuffle. If this is the case, it most likely means you have proxy issues, and can't reach github.com, where our apps are hosted.
Here's how to manually load them into Shuffle using git.
#1. If a proxy is required for your environment: Set up the proxy for Git (install if you don't have it).
git config --global http.proxy http://proxy.mycompany:80
#2. Go to the shuffle folder where you have Shuffle installed, then go to the shuffle-apps folder (./shuffle/shuffle-apps)
git clone https://github.com/shuffle/python-apps
#3. Go to the UI and hotload the apps: https://shuffler.io/docs/app_creation#hotloading_your_app (click the hotload button in the top left in the /apps UI)
Alternatively: You can go download the latest Shuffle apps in your browser, and manually extract the .zip
file into the ./shuffle/shuffle-apps
folder.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
869c99231ed0 opensearchproject/opensearch:latest "./opensearch-docker…" 5 weeks ago Up 2 days 9300/tcp, 9600/tcp, 0.0.0.0:9200->9200/tcp, 9650/tcp shuffle-opensearch
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 869c99231ed0
or
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' shuffle-opensearch
Output:172.21.0.4
curl -XDELETE http://172.21.0.4:9200/workflowqueue-shuffle
Output:{"acknowledged":true}
Follow Python scripts allows to massively stop all running executions of a workflow
import sys
import requests
API_ENDPOINT = "http(s)://<shuffle_endpoint>/api/v1"
API_KEY = "<your_api_key>"
WORKFLOW_NAME = "<workflow_name>"
def main():
headers = {
"Authorization": "Bearer " + API_KEY
}
with requests.get(API_ENDPOINT + "/workflows", headers=headers) as response:
response.raise_for_status()
data = response.json()
wflows = list(filter(lambda wf: wf.get("name") == WORKFLOW_NAME, data))
if len(wflows) == 0:
print("Workflow not found")
return 2
if len(wflows) != 1:
print("Something goes wrong")
return 1
wflow = wflows[0]
# Get executions
# example: http(s)://<shuffle_endpoint>/api/v1/workflows/b519c8f7-e9b0-4b47-b93d-cac013e4522f/executions
wflow_id = wflow.get("id")
url = "{}/workflows/{}/executions".format(API_ENDPOINT, wflow_id)
with requests.get(url, headers=headers) as response:
response.raise_for_status()
data = response.json()
still_running = list(filter(lambda ex: ex.get("status") == "EXECUTING", data))
for exec in still_running:
exec_id = exec.get("execution_id")
print("[INFO] We're going to abort execution with ID {}".format(exec_id))
url = "{}/workflows/{}/executions/{}/abort".format(API_ENDPOINT, wflow_id, exec_id)
with requests.get(url, headers=headers) as response:
response.raise_for_status()
data = response.json()
if data.get("success"):
print("[INFO] Execution successfully aborted")
else:
print("[ERROR] Unable to abort execution")
return 0
if __name__ == "__main__":
sys.exit(main())
Copy the script into a file called abort_running_executions.py
and run it with
**Use python or python3 depending of your environment**
python abort_running_executions.py
In order to work requests Python library must be installed in your Python execution env.
Set the ownership of the shuffle-database folder that the shuffle-opensearch
container expects.
sudo chown 1000:1000 -R shuffle-database
We recommend you to know what you're doing when you delete a user! It can have unpredictable consequences
Find the ID of the user (Replace the username of your user in the query with the placeholder!)
curl -X GET "https://localhost:9200/users/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "match": {"username": "<USERNAME>"} } } --insecure -u <opensearch_user>:<opensearch_password>
Find and take the "_id" value of your user from the returned!
...
"_id" : "<user_id>",
...
Delete the user
curl -X DELETE "https://localhost:9200/users/_doc/<user_id>?pretty" -H 'Content-Type: application/json' --insecure -u <opensearch_user>:<opensearch_password>
If you find yourself in a situation where you have forgotten your passowrd and need a reset for your user, you can reset your lost password in your local instance by doing the following:
1. docker exec to get bash session into OpenSearch container docker exec -it <container_id> bash
users.log
filecurl -X GET "https://localhost:9200/users/_search?pretty" --insecure -u <opensearch_user>:<opensearch_password> -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}
' > users.log
open the users.log
file with less
and search for the admin user. Once found, scroll down to the apikey
section. this value will be the api key of the admin user.
I jumped onto another server within the same vlan as my Shuffle server but these could be ran on local host too. We will create a new user and update the user's role to admin with the Shuffle API.
(Optional step): If you have multiple organizations, Change active org like this and repeat for each org:
curl 'https://ip of shuffle server/api/v1/orgs/{org_id}/change' \ -H ' "Authorization: Bearer {API_KEY}' --data-raw '{"org_id":"{org_id}"}'
Create a new user
curl https://ip of shuffle server/api/v1/users/register -H "Authorization: Bearer APIKEY" -d '{"username": "username", "password": "P@ssw0rd"}'
curl https://ip of shuffle server/api/v1/users/getusers -H "Authorization: Bearer APIKEY"
curl https://ip of shuffle server*/api/v1/users/updateuser -H "Authorization: Bearer APIKEY" -d '{"user_id": "USERID", "role": "admin"}' -X PUT
curl -X GET "https://localhost:9200/users/_search?pretty" -u <opensearch_user>:<opensearch_password> --insecure -H 'Content-Type: application/json' -d' { "query": { "match": {"username": "myuser"} } }
curl -X GET "https://localhost:9200/organizations/_search?pretty" -u <opensearch_user>:<opensearch_password> --insecure -H 'Content-Type: application/json' -d' { "size": 10000, "query": { "match_all": {}}}'
Find all org IDs
curl -X GET "https://localhost:9200/organizations/_search?pretty" -u <opensearch_user>:<opensearch_password> --insecure -H 'Content-Type: application/json' -d' { "size": 10000, "query": { "match_all": {}}}' | grep "\"id\" : \"" | sed 's/ *$//g' | sed 's/^[ \t]*//;s/[ \t]*$//' | uniq -u
This procedure can help you extract workflows directly from OpenSearch even if the Backend and FrontEnd are in an awkward situation.
Extract the index info from OpenSearch. NOTE: You may need to create a bind mount for the location where the workflows will be extracted to.
curl -X GET "https://localhost:9200/workflow/_search?pretty" -u admin:StrongShufflePassword321! -k -H 'Content-Type: application/json' -d' { "size": 10000, "query": { "match_all": {}}}' > workflows.json
Script to separate all workflows
import json
import os
data = {}
with open("workflows.json", "r") as tmp:
data = json.loads(tmp.read())
foldername = "./workflows_loaded"
try:
os.mkdir(foldername)
except:
pass
## Will break with keyerror lol
for item in data["hits"]["hits"]:
#print(item)
try:
item = item["_source"]
except:
continue
filename = f"""{foldername}/{item["name"]}.json"""
print(f"Writing {filename}")
with open(filename, "w+") as tmp:
tmp.write(json.dumps(item))
This script need to be run on the folder with the file workflows.json
, it will create a workflows_loaded
directory with all the workflows in it.
This can also be very useful to either backup a copy your work or export it from a lab to a prod instance.
If you lost an index due to corruption or other causes, there is no easy way to handle it. Here's a workaround we have for certain scenarios. What you'll need: access to another Shuffle instance, OR someone willing to share. Lets do an example rebuilding the environments index. This assumes opensearch is on the same server.
curl -XDELETE http://localhost:9200/environments
curl -XPOST -H "Content-Type: application/json" "https://localhost:9200/environments/_doc" -u <opensearch_user>:<opensearch_password> --insecure -d '{"Name" : "Shuffle","Type" : "onprem","Registered" : false,"default" : true,"archived" : false,"id" : "26ae5c79-a6f3-4225-be18-39fa6018cdba","org_id" : "49eeb866-c8b4-4ea0-bc19-9e650e3bba9e"}'
Check the index
curl https://localhost:9200/environments/_search?pretty -u <opensearch_user>:<opensearch_password> --insecure
After narrowing down your problem to opensearch is what is consuming your system resources you need to figure out why? Might be too many indices in OpenSearch or just a java heap size problem
docker exec -u0 -it "opensearch_ID" curl https://localhost:9200/_cat/indices?pretty -k -u admin:StrongShufflePassword321! ```
docker exec -u0 -it "opensearch_ID" curl https://localhost:9200/_cat/indices?pretty -k -u admin:StrongShufflePassword321! | grep -v security
docker exec -u0 -it "opensearch_ID" curl -X DELETE "https://localhost:9200/workflowexecution?pretty" -k -u admin:StrongShufflePassword321! -v
If you're on the default setup for shuffle and you start to notice that your workflows are getting stuck, It might be because you're running out of CPU needed to run the workflows.
To fix this, You will have to move towards setting up shuffle for production readiness in our configuration documentation.
If you are doing this in a production server you will have to comb through the indices and delete them manually with respect to you organisations priorities, old executions and such.
You should notice a reduction in memory consumption check this by running top. Do a docker-compose down then a docker-compose up -d for good measure and you are good to go.
If the above steps do not fix the issue then this might mean its a java heap size issue, go into your docker-compose.yml file, move down till you locate the opensearch configurations and navigate to the OPENSEARCH_JAVA_OPTS settings and change them if initially they were running at 4 gb half that to 2 gb and save the file.
- "OPENSEARCH_JAVA_OPTS=-Xms2048m -Xmx2048m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
In certain cases, you may experience OpenSearch continuously restarting. PS: All of these can be spotted in the logs. There are a few reasons for this which should be checked in the following order:
vm.max_map_count=262144
setting?1000:1000
by default)?1000:1000
) working?.env
file?In certain cases, especially when you're running in swarm mode (Make sure ports: 2377, 7946 and 4789 between your machines internally), you may experience timeouts, EOFs. Or maybe, in different cases a TLS timeout error, or a similar network request issue. This is most likely due to the network configuration of your Shuffle instances not matching the server it's running on.
The main configuration is "MTUs", AKA Maximum Transmission Unit. This has to match exactly - with the both the docker network driver bridge and shuffle_swarm_executions.
Find the MTU of your preferred network interface:
ip addr | grep mtu
It is usually the network interface in the second line. Get it's MTU!
To set the MTU in Docker, do it in the docker-compose, in the networking section. Say the MTU you found was 1460, then use 1460, as can be seen below.
networks:
shuffle:
driver: bridge
# uncomment to set MTU for swarm mode.
# MTU should be whatever is your host's preferred MTU is.
# Refer to this doc to figure out what your host's MTU is:
# https://shuffler.io/docs/troubleshooting#TLS_timeout_error/Timeout_Errors/EOF_Errors
driver_opts: # removed comment from here
com.docker.network.driver.mtu: 1460 # removed comment from here.
Next, if you're running on swarm mode, delete the existing shuffle_swarm_executions
network if it already exists. You can do that by using:
sudo docker network rm shuffle_swarm_executions
This might be an essential step to enforce what we did in the last step. Shuffle things a lot of things under the hood and syncing up the right interfaces is one of them so that you don't have to worry about it.
If it requires removing dependant services, proceed to do that.
When done, restart the docker-compose. Now the issue should be automatically taken care of. If not, and you're on swarm mode, Proceed to the next step of manually setting the network MTU:
We need to make a network named the same as the environment SHUFFLE_SWARM_NETWORK_NAME for Orborus (default: shuffle_swarm_executions):
docker network create --driver=overlay --ingress=false --attachable=true -o "com.docker.network.driver.mtu"="1460" shuffle_swarm_executions
If the issue still persists, Please look into changing the environment variable SHUFFLE_SWARM_BRIDGE_DEFAULT_INTERFACE
. Shuffle takes care of syncing the docker0 bridge interface to the preferred interface of the container. Changing this value might help docker sync up things better. We assume that the interface name is "eth0" by default, which is the default setting.
If none of this works, Often times it's simply because of the virtualisation used by your cloud provider. For example, We have found these issues to be persistent with providers using VMware underneath, Refer to this for a fix
ARM is supported on Shuffle since 1.3.0!
In certain scenarios, permissions inside and outside a container may be different. This has a lot of causes, and we'll try to help figure them out below. Thankfully most fixes are relatively simple. To test this try to go to /admin?tab=files in Shuffle, and upload a file. If the file is uploaded and it says status "active", all is good. If it's not being uploaded, then it's most likely a permission issue.
In the docker-compose.yml file, find the "shuffle-files" volume mounted for the backend service. Simply add a ":z" on the end of it like so:
- ${SHUFFLE_FILE_LOCATION}:/shuffle-files:z
Then restart the docker-compose (down & up -d), and try to upload a file again.
Disable Selinux to test. This should take immediate effect (run as root).
setenforce 0
After, try to upload a file again
To find the folder permissions inside the container
docker exec -u 0 shuffle-backend ls -la /
In certain scenarios or environments, you may find the docker socket to not have the right permissions. To work around this, we've built support for the docker socket proxy, which will give the containers the same permissions. Another good reason to use the docker socket proxy is to control the docker permissions required.
To use the docker socket proxy, add the following to your docker-compose.yml as a service:
docker-socket-proxy:
image: tecnativa/docker-socket-proxy
privileged: true
environment:
- SERVICES=1
- TASKS=1
- NETWORKS=1
- NODES=1
- BUILD=1
- IMAGES=1
- GRPC=1
- CONTAINERS=1
- PLUGINS=1
- SYSTEM=1
- VOLUMES=1
- INFO=1
- DISTRIBUTION=1
- POST=1
- AUTH=1
- SECRETS=1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
networks:
- shuffle
When done, remove the "/var/run/docker.sock" volume from the backend and orborus services in the docker-compose. These containers should route their docker traffic through this proxy. To enable the docker rerouting, add this environment variable to both of them:
- DOCKER_HOST=tcp://docker-socket-proxy:2375
This will route all docker traffic through the docker-socket-proxy giving you granular access to each API.
PS: Adding :z to the end of the volume may fix this issue as well.
If the server Shuffle is running on is slow, it's likely due to the same constraints of any other server. One of these are typically the culprit:
The normal reason this happens is due to too many processes running concurrently in Docker (too many containers). To look at ideal configurations, look at production readiness in our configuration documentation.
First we check CPU. This is can be done using the "top" command.
top
The typical near the top is something like this. If the CPU usage is too high (see line three '%Cpu(s): 11.0 us'" - this means 11% is used total), you've most likely not configured Shuffle to run with the appropriate amount of containers as a maximum, with bad cleanup routines (CLEANUP=true).
top - 20:14:37 up 27 days, 24 min, 2 users, load average: 17.88, 15.17, 13.21
Tasks: 244 total, 2 running, 241 sleeping, 0 stopped, 1 zombie
%Cpu(s): 11.0 us, 10.7 sy, 0.0 ni, 2.0 id, 74.8 wa, 0.0 hi, 1.5 si, 0.0 st
KiB Mem : 8008956 total, 142236 free, 7561784 used, 304936 buff/cache
KiB Swap: 8257532 total, 4550684 free, 3706848 used. 69760 avail Mem
Fix: Stop docker containers and reduce the amount that are allowed to run. If everything is TOO slow, reboot the server and stop all containers when it's started back up:
docker stop $(docker ps -aq) --force
Next up is RAM. This is can also be done using the "top" command.
top
As with CPU, the information is near the top of your screen and looks something like this. If the RAM usage is too high (see line four 'KiB Mem : 8008956 total, 142236 free'" - this means that almost no memory is left on the device). This is a typical problem if you've enabled app log forwarding into Shuffle. To disable log forwarding, add the environment "SHUFFLE_LOGS_DISABLED=true" to Orborus, then bring it down and back up again.
top - 20:14:37 up 27 days, 24 min, 2 users, load average: 17.88, 15.17, 13.21
Tasks: 244 total, 2 running, 241 sleeping, 0 stopped, 1 zombie
%Cpu(s): 11.0 us, 10.7 sy, 0.0 ni, 2.0 id, 74.8 wa, 0.0 hi, 1.5 si, 0.0 st
KiB Mem : 8008956 total, 142236 free, 7561784 used, 304936 buff/cache
KiB Swap: 8257532 total, 4550684 free, 3706848 used. 69760 avail Mem
Fix: Stop docker containers and reduce the amount that are allowed to run. If everything is TOO slow, reboot the server and stop all containers when it's started back up:
docker stop $(docker ps -aq) --force
Next up is disk space - can Shuffle save anything? See whether there is space on the machine in the location Shuffle is running
df -h
To get more space, either delete some files, clean up the Opensearch instance or add more disk space.
Download the correct app version from shuffle cloud
Once downloaded upload it on your onprem shuffle instance by dragging and dropping it on the activated app list
Once done check the server for misp images present
You should see previous existing images and the newly added apps image
The last step is to refer the target image to the source image that you uploaded
You do this by using the docker tag command see more information here (https://docs.docker.com/engine/reference/commandline/tag/) docker tag frikky/shuffle:misp_1.0.0 davvyshuffle/shuffle:MISP-e72b9e9c5b0a40753e184c8ce0ba6c2b i.e docker tag source_image:{TAG} target_image:{TAG}.
Go back to your shuffle interface and your app should run success.
If you intend on uploading the app in a remote server you could push the app image onto docker hub using the docker push command more info here(https://docs.docker.com/engine/reference/commandline/image_push/)
Sign up on docker here (https://login.docker.com/u/login/) then push the intended image into your docker hub repository, from your server's cli. You might be prompted to enter your password, do so and your image will be uploaded successfully.
From your remote server cli pull the image from your docker hub repository. for more info about docker image pull see here (https://docs.docker.com/engine/reference/commandline/pull/)
Once this is done you have to tag the existing images of this app to the working app you just downloaded from your docker hub repo.
In certain cases, Docker may not be working due to too large an amount of containers running, and Docker not being able to keep up. The cause of this is typically Orborus starting too many workflows in unison. To fix this, either reduce the amount of containers able to run, or set up swarm mode (paid).
This can be controlled by the environment variables:
- SHUFFLE_ORBORUS_EXECUTION_TIMEOUT=600
- SHUFFLE_ORBORUS_EXECUTION_CONCURRENCY=10
- CLEANUP=true
Then manually clean up the containers:
service docker stop
rm -rf /var/lib/docker/containers/*
rm -rf /var/lib/docker/vfs/dir/*
service docker start
PS: You may need to use "systemctl stop docker" instead of using "service".
Now restart the Shuffle stack again, and all the containers should be gone
docker logs -f shuffle-orborus
docker service ls
Once you've ensured that all services have replicas created and are running, move on to the next step.
Check logs for the worker. First, run the below command to get a list of tasks that are running
docker ps
docker logs <name_of_worker>
You may have had problems with an app and need some help getting it fixed. Apps created in the app creator of Shuffle also do generate underlying Python code utilizing the same capabilities as if you make a Python function from scratch. Here's how to find the code for a function.
docker images | grep -i defectdojo
docker run <imagename> cat app.py > app.py
def post_importscan_create(self,...)
Copy everything indented under this function and sent to support@shuffler.io for further help!
If you loose your tenants/suborgs for any reason at all and you need to reinstate them then you will need to do the following;
550e8400-e29b-41d4-a716-446655440000
. docker logs shuffle-backend
docker exec -it shuffle-opensearch bash
curl -k -u admin:StrongShufflePassword321! -H 'Content-Type: application/json' 'https://shuffle-opensearch:9200/organizations/_update/<org_id goes here>/' -d '{"doc": {"id": "org-id-goes-here", "name": "org-name-goes-here"}}'
Exit out of the container
Restart docker
$docker-compose down
$docker-compose up -d
Go back to your shuffle instance and you should see the org in question reinstated in the tenants tab.
As the Opensearch index may fill up over time, it is important to be able to debug the the available indexes. One particular issue we have had has been that it takes >60 seconds to load apps onprem at times. Here is how to resolve them.
Get into the Opensearch container of port 9200 is not exposed by default:
docker exec -u0 -it shuffle-opensearch bash
Get the indexes and look at them:
curl https://localhost:9200/_cat/indices?v -u admin:StrongShufflePassword321! -k
Find the largest items in the workflowapp index (for apps in shuffle):
curl https://localhost:9200/workflowapp/_search?v -u admin:StrongShufflePassword321! -k
Delete an index if it's too large (normal ones to delete if problems: workflowexecution, workflowqueue-shuffle, environment_stats)
curl -XDELETE https://localhost:9200/workflowqueue-shuffle -u admin:StrongShufflePassword321! -k
If you have lost access to Shuffle, it usually due to an unforeseen disconnect to the Database during startup, leading to more Organizations being added. To fix this, your user needs to be re-added to the original Organization
Find the Organization and User Id of your account. They are in the UUID format 550e8400-e29b-41d4-a716-446655440000
docker logs -f shuffle-backend
Get into the Opensearch container of port 9200 is not exposed by default:
docker exec -u0 -it shuffle-opensearch bash
Update the USERID and ORGID, ORGNAME fields, then run this command to re-add your account to the right org
curl -k -u admin:StrongShufflePassword321! https://localhost:9200/users/_update/USERID -d '{"doc": {"active_org.id": "ORGID", "active_org.name": "ORGNAME", "orgs": ["ORGID"]}}' -H "Content-Type: application/json"
If you want to install a custom module like pandas (although, 'execute python' isn't made for heavy processing. We recommend making a custom python app for it).
There are two primary ways to it: 1. Learn about dynamic library loading in your code (not recommended) 2. Build the Shuffle Tools app again locally and add the libraries you want to it! Library configuration is over in this requirements.txt file
To build this app again, I would:
sudo docker build -t shuffle-tools:1.2.0 .